(Image obtained from publication by Liu et. al )

Topic modelling  is a method used to analyse a large volume of non-annotated text. It’s a type of text mining approach that uses statistical machine learning techniques to discover hidden semantic structures within a large amount of text. By grouping together clusters of words that occur frequently together, a ‘topic’ emerges and can be defined.

How to use it to understand the evolution of scientific knowledge

Currently scientific knowledge that is generated is archived in the form of research articles. Each year, the number of articles generated has been increasing rapidly. A quick analysis on Pubmed searching for articles published each year from 1970 to 2017 shows that the number of publications has more than doubled only between 2000 and 2017 (Figure 1).

Figure 1: Number of articles on pubmed over the years

Identify trends in the evolution of research

Given the deluge of information, it is impossible to get an overview of all the different topics being researched upon without the use of algorithms that can mine the data quickly and create these overviews for people or experts to analyse. This is particularly useful for those who are working in the field of scientific management or policy, where identifying trends in the evolution of research becomes crucial for strategic decision-making. For example, topic modelling can be used to scan grant applications to map the topics to funding status. Alternatively, topic modelling could be carried out to identify areas that are in need of further development and growth.

Topic modelling for individual researchers

Topic modelling is also useful at an individual researcher level, though few are yet to realize this. Imagine that you are a Principal Investigator in the field, like bacteria cell development (my PhD topic) and you would like to know where your field is heading. Instead of relying on patchy information gathered from conferences or word of mouth from other investigators, topic modelling will reveal to you the areas that are trending, allowing you to know where your research is at, in comparison to the topics that are ‘hot’ in the field. Alternatively, you could also look at the topics that emerge, and identify a missing gap in the research and instead focus on that to do truly original research. In the past, one way PhD students and postdocs decide on projects to focus on is by spending hours looking through many research articles to identify a niche area where they could make a difference. Topic modelling has the potential to make this process far more efficient, and even reveal areas where real impact could be made. If only I had this tool in the early days as a PhD student!

How words within topics are different from author or journal keywords

When a paper is submitted and published, authors are requested to submit keywords they consider to describe the article. Additionally, the journal editorial also assigns keywords to the article for publication. These keywords can be used to create overviews or maps and has been used in the field of scientometrics or bibliographic research to visualize how research areas group together and evolve over time.

Topic modelling on the other hand uses a bottom up machine learning approach where topics are not predefined by individuals but instead emerge organically from the data according to what words are present within the abstracts. In one way, topic modelling avoids any pre-existing definitions or biases regarding what a topic should look like (of course, bias can come into play when labelling the topics, so be aware of this caveat and try to minimise it). Oftentimes, authors do not realize that their choice of words in the abstract reveal underlying patterns in the semantics. This method attempts to uncover these hidden patterns in order to create topics of research areas based on word groupings in the text.

Topic modelling or keyword clustering?

Both approaches are useful and can be used to complement each other.  In theory, they can be used to confirm one another, as one would expect that topics that emerge from topic modelling to map well onto clusters of keywords that were assigned by authors or journals.

In the work for MDR-TB, preference is placed on Topic Modelling as the first approach as it allows structure to emerge from the data in as objective a way as possible, followed by other approaches such as keyword mapping to confirm the findings in the Topic Modelling.

Playing with the topics

Once topics are defined, many different types of analyses could be carried out to reveal more insights. For example, these are a few questions we could now proceed to analyse:

  • Which authors are associated with the topics? This tells us who the key players are in the field and map them to the topics.
  • Which institutions are most active in certain topics? Are certain places more likely to work on certain topics than others? If you are looking to collaborate with specific researchers on a topic, this could be very useful information to have.
  • How do the topics evolve over time?Are certain topics increasing or decreasing in popularity over time? If so, perhaps one could analyse this further by conducting more extensive research. This is also a way to quickly identify a research gap, by comparing the need in the field versus the actual research output over time.

We are not only limited to doing these analyses separately but can also combine some of these questions in one visualization in order to obtain more insights. Therefore, Topic Modelling can serve as a launch pad for other investigations.