The aim of this study was to determine if topic modelling (see What is Topic modelling) could accurately determine subject areas within a research field based on metadata obtained from articles that have been published. After a search for ‘MDR-TB’ in the Web of Science Core Collection, data from 2101 articles were used for topic modelling and citation analysis.

We then utilized a combination of visualizations together with researching the literature to label the topics that emerged from the computation as best as we could. We attempted to achieve a balance between keeping the labelling general, whilst at the same time ensuring that it was meaningful enough to enable insights to be made. A brief overview of the labelling process can be read here.

Figure 1: Percentage of labelled topics that make up the corpus

Strong emphasis on solving the MDR-TB problem by treatment of cases

Topic modelling revealed that a large proportion of research within MDR-TB field focuses on improving treatment. The largest topic (topic 5) that was defined as ‘Treatment optimization’ consisted of smaller subtopics; of which the largest two were labelled as ‘Treatment regimens’ and ‘Risk factors’. This topic could therefore be interpreted to be about exploring ways to optimise treatment by testing various drug combinations in new or existing regimens and identifying social and physical risk factors that affect treatment outcome.

The second largest topic (topic 1) was labelled as ‘Drug-related research’ and also consisted of subtopics, of which the most prominent were labelled as ‘New compounds’ and ‘Resistance genes’. The labelling ‘drug-related research’ therefore is intended to describe a topic that focuses on characterizing mechanisms of resistance at the molecular level in order to inform the synthesis of new drugs to combat resistance.

Together, the two topics of Treatment optimization and Drug-related research, make up 61% of the research output in MDR-TB and suggest that the predominant strategy towards solving the MDR-TB problem is geared towards drug treatment of existing cases.

Knowledge gap on operational research/public health

Compared to the previous two, the next largest topic is visibly smaller at 16% of the research output. This topic (topic 2) was enriched  with words such as case, control, country, need, strategy, global, report, cost, rate, system, estimate, model, transmission, burden and care, which together gives an impression that the topic has something to do with public health. The articles found within this topic ranged from standard epidemiological studies using transmission models or cross-sectional surveys to qualitative and quantitative operational and implementational research studies. In contrast to the top two topics, this topic is distinct by its emphasis on understanding the social and economic perspectives of MDR-TB in order to identify causes and barriers to care so that better policies can be designed and implemented.

Given the importance of obtaining valuable on the ground knowledge about MDR-TB interventions, it was therefore surprising that this topic made up only a small fraction of MDR-TB research.

Few publications on diagnostics within MDR-TB research

The next topic (topic 3) was relatively well-defined as ‘Diagnostics’ based on the presence of characteristic words (eg. assay, detection, rif, rapid, mods, sensitivity, dst, test, culture, specificity, mdrtbplus, genotype, xpert) and papers confirming the topic content, a topic that made up only 11% of MDR-TB research output. Although this search was centred around MDR-TB, and Diagnostic research may be more abundant if the wider search term TB/tuberculosis was used, these results mirror the finding of WHO’s 2017 Policy Paper on Global Investments in TB R&D, where the Diagnostics field received only 9% of funding in the years 2009-2015.

This lack of investment in Diagnostics R&D can therefore be directly observed in the low research output on Diagnostics within the MDR-TB field.

Immunology of MDR-TB, a field for development?

The last two well-defined topics are similar in size and were labelled as ‘Molecular typing’ (topic 0) and ‘Immunology’ (topic 6). Despite the small numbers of articles containing words that made up these topics, these articles did appear specific and conveyed a trend towards a specific grouping. This was particularly evident with the topic labelled as Immunology. Despite the low number of words present in this topic, these words were specific for immunological concepts such as cell, blood, ifn_gamma, immune, vaccine, plasma and il_2. The top articles in the topic, which discussed the differences and defects in the immune responses of MDR-TB patients suggest that this is a coherent topic of research.

Unlike research efforts in TB immunology related to vaccine research and development, the immunology of MDR-TB is a subject area that has not yet boomed. This may represent a research area that could be developed further for its potential in the development of immunological biomarkers as some articles in this topic suggest that the immune responses to MDR-TB may be distinct from that of drug-susceptible TB.

Viewing topic distribution in a citation network

Visually, topic distribution can also be seen via analysis of the MDR-TB citation network. Below, a force diagram algorithm depicts the topics visually when mapped onto the citation network. The largest topic, Treatment optimization can be seen in orange to the left of the network, making up a large part of the network. At the opposite end of the network are Drug-related research (green), Diagnostics (yellow) and Molecular typing (magenta). The force algorithm places articles in proximity according to similarity based on citation links. Thus it is interesting to see that Operational research/public health (blue) can be found closer to the topic Treatment optimization, which is perhaps not surprising given the importance of epidemiology and policy on determining treatment. It is also confirming that the topics of Diagnostic is closely linked to Molecular typing and Drug-related research, which includes the subtopic Resistance genes that forms the basis of any diagnostic development.



Topic modelling is a useful tool that allows us to structure information that exists within a large database such as that found in archives of research articles in order to find patterns that were previously hidden from view. Using this method, research domains became evident, providing insights into the way the MDR-TB field has developed. Once topics have been identified, a multitude of analyses could be conducted further such as investigating the relationship between topics and articles within a citation network or how topics have evolved over time, to name a few.