*This article first appeared in Medium’s Towards Data Science magazine on April 3, 2020.*

On February 10th, STAT+ published an interesting article on the fluctuating interest and funding in coronavirus research (see this link). Reading the article inspired me to investigate further into the state of coronavirus research, specifically in order to understand the landscape of coronavirus research up to the end of 2019, right before the COVID-19 outbreak took off worldwide.

In this essay, I will share some preliminary explorations I made to obtain an overview of the research field, using a text mining approach.

For those who are not familiar with text mining, this website provides a simple introduction to the technique. The point of text mining is to transform unstructured text data into information that provides valuable insights. In this investigation, the algorithm used by the VOS Viewer application scans thousands of abstracts and places the frequently used terms (words) on a two-dimensional map for visualization where the distance between two terms indicates how related the terms are to each other. If terms often occur together in the same abstract, the distance between them will become smaller, and eventually clusters of terms may be seen.

This technique is a powerful way to understand the discourse within a field, as clustering of words indicates a topic of discussion or research within the field. This way, research themes can be detected emerging from the data. Without such a computational method, it is very difficult (if not impossible) to obtain a structure from the massive amount of information generated by research in modern times.

Constructing a term map of coronavirus research

Meta-data from 11,224 articles downloaded from Lens.org using a combination of coronavirus search terms dated from the 1st of January 1990 to the 31st of December 2019 was used to first generate a term network map in VOS Viewer. This was followed by further refinement of the data and network using OpenRefine and Gephi. Multiple iterations were made until the final network was settled upon.

In the term map in Figure 1, five different clusters have been generated; a modularity level I thought to be reasonable for this early exploration. Also interesting is the structure of the network, which reveals clusters opposite from another, with a bridging cluster between the two sides.

The size of the nodes (colored circles) can be set according to various parameters. For the visualization in Figure 1, the most commonly used terms were emphasized, where the larger nodes denote a higher occurrence of the word. Labels are shown only for the most frequent terms used in the corpus (body of words from the collection of articles) in order to not overload the visual.

Figure 1: A term co-occurrence map from title and abstracts of articles on coronavirus research from 1990 to 2019. After reconciliation of duplicate terms and experimenting with different modularities and cut offs, 450 nodes and 3036 edges were settled upon for the final visualization. The sizes of the nodes were set to the number of term occurrences whereby the larger the node, the more frequent the term. Only words that have appeared a minimum of 20 times (i.e. in at least 20 articles) were included in the analysis. Binary counting was used, based on presence or absence of a term in the abstract, and not the full number of occurrences within an abstract. Click image for a larger view. Data: Lens.org.

We could also ask if there are terms of particular importance within the network, beyond frequency. Citation counts have been commonly used to measure the importance of a research work. It can therefore be assumed that words coming from articles with high citation counts are originating from articles deemed significant by other researchers in the field.

Figure 2: Comparing a term map based on term occurrence (A) to average normalized citation (B). Click image for a larger view. Data: Lens.org.

In Figure 2, the difference between a term map based on occurrence (left, A) versus one based on citation (right, B) can be compared to one another. Note that many words that had a high occurrence (A) diminish in size significantly when viewed in the citation map (B), and new terms that were previously uncommon in the occurrence map (A) have gained prominence within the citation map (B), as seen by the change in node sizes. We will dive deeper into the use of these different term map views when describing the themes below and to uncover insights within each cluster.

Themes within coronavirus research

To identify the themes, I examined the terms prevalent from each cluster and did some additional research to expand on a few of the terms to demonstrate how this approach could be used to generate insights. In my experience, labelling a topic is not an exact science, and requires some knowledge of the field to be meaningful. I relied upon my background in clinical infectious diseases and vaccine research to help me in this process, though input from experts working in the coronavirus research field would be helpful to make the labelling more accurate.

Cluster 1: Molecular and cellular biology of coronaviruses

Figure 3A shows the most commonly used terms within cluster 1. Many of these terms implicate basic studies in coronavirus biology such as spike proteinreplicationmechanismstructuregenome and rna.

Figure 3: Visualizing the highest ranking terms in cluster 1 based on term occurrence (A) or average normalized citation (B). Click image for larger view. Data: Lens.org.

An interesting term, mhv can also be found among the most frequently occurring terms in this cluster. What is mhv?

Before the SARS coronavirus was discovered, much of the research on coronaviruses was conducted using the mouse hepatitis virus (MHV), a murine coronavirus which, depending on its tropism, can infect the respiratory tract, gastrointestinal track or the neurological system (see this review). The neurotropic MHV strains have been used to create a mouse model of multiple sclerosis due to its ability to provoke demyelination. The MHV model is therefore the foundation for much of our understanding on coronavirus infection biology today. Given the recent finding of the SARS-COV-2 virus to provoke anosmia (loss of smell), it might be worthwhile to revisit some of the older literature on the MHV and in its effects on the central nervous system.

Changing the view to average normalized citation count (Figure 3B), resulted in new terms such as cathepsin Land synctium formation to stand out, suggesting these terms to be present in highly cited articles.

Digging deeper, Cathepsin L turns out to be an important molecule, as it is a host enzyme used by the SARS virus to activate its fusion into the host cell using its spike protein (see this article). Inhibitors of Cathepsin Lhave been shown to prevent the entry of the SARS coronavirus (article), though to my knowledge this treatment strategy is not currently being considered in the treatment of COVID-19.

Synctium formation revealed a particular feature of coronaviruses such as those causing SARS and MERS to mediate cell to cell fusion between infected and uninfected cells adjacent to it, a strategy that enables the virus to spread directly between cells and avoid neutralising antibodies (article).

Cluster 2: Public health and clinical research on coronavirus outbreaks

Unlike in cluster 1, there is a tendency here to cluster words implicating clinical and public health research (Figure 4A). Words such as transmissionlaboratorysurveillance and epidemiology implicate a public health approach to investigate coronavirus outbreaks. Other terms like pneumoniafeversymptoms and illness suggests a clinical approach. Many people by now, are already familliar with the term mers, which stands for Middle East Respiratory Syndrome, the most recent coronavirus outbreak in humans.

Figure 4: Visualizing the highest ranking terms in cluster 2 based on term occurrence (A) or average normalized citation (B). Click image for larger view. Data: Lens.org.

To find out the most significant terms in this cluster, the normalized citation view is used (Figure 4B). Bat coronavirus now appears prominently in this cluster, due to its similarity to the SARS and MERS coronaviruses and high subsequent clinical importance. It is closely situated with the terms bat species and natural reservoir which together point to studies to understand the ecology of bat coronaviruses in the wild.

Another significant term, renal failure prompted further inquiry. In SARS patients, acute renal failure was found to be a highly unfavourable outcome, though relatively uncommon at 6.7% (article). Concerningly, renal abnormalities have now been observed in COVID-19 patients according to two pre-print articles from China (this article and this article). This example highlights how such a term map based on citations could bring our attention to insights that may otherwise be lost in the sea of information.

Cluster 3: Coronavirus infections in animals

This interesting cluster resides between the basic science and the clinical research clusters, bridging the groups (Figure 1). At closer look, this grouping is enriched with terms about coronavirus diseases in animals such as bovine coronavirus and feline coronavirus as well as tgev which is a coronavirus that infects pigs (Figure 5A). The term ibv refers to infectious bronchitis virus, a highly infectious coronavirus diseases in chickens. Not visible within the current view but present within this cluster at lower algorithmic thresholds were other terms such as equine coronavirus and canine coronavirus, revealing the striking variety of animal hosts that coronaviruses have adapted to. This also brings to the foreground the little known fact among the public that all coronaviruses originate from animals, eventually adapting to humans and causing disease.

Figure 5: Visualizing the highest ranking terms in cluster 3 based on term occurrence (A) or average normalized citation (B). Click image for larger view. Data: Lens.org.

The highly prevalent words such as pcr and sample reflect the methodology used to identify and characterize these viruses but due to its ubiquitous character may not necessarily be characteristic of this cluster in particular. This can be seen in the normalized citation view (Figure 5B) where these terms diminish in size considerably.

Within this view, the terms pedv infection and porcine delta cov stood out. These coronaviruses result in severe disease in neonatal piglets causing a high mortality rate and major economic loss to farmers estimated at $1 billion during the first outbreak in 2013 (article).

Cluster 4: Viral respiratory infections

The cluster of terms describing other respiratory viruses such as influenzaadenovirusrsv and rhinovirussuggests that coronaviruses are also researched in the context of viral respiratory infections (Figure 6A). Common and non-specific terms such as year and month suggests descriptions of the timing of these infections. Less common terms such as nl63hku1 and oc43 turns out to be coronavirus infections in humans that generally causes upper respiratory tract infections presenting as the common cold. It is interesting to note that the HKU1 coronavirus originated from infected mice, whereas OC43 is thought to have jumped to humans from cattle.

Figure 6: Visualizing the highest ranking terms in cluster 4 based on term occurrence (A) or average normalized citation (B). Click image for larger view. Data: Lens.org.

In Figure 6B, the normalized citation view allows us to observe terms such as clinical datanasopharyngeal aspirateviral cause and odds ratio becoming prominent. This hints that research in this cluster revolves around clinical studies to identify and monitor viral infections of the respiratory tract.

Cluster 5: The immune response and viral pathogenesis in the mouse model

This small cluster, which appears as an offshoot of cluster 1, reveals many insights with closer examination. Mouse is a major term, which suggests that the mouse model is central to this theme. Within the mouse model, various angles appear to be investigated, such as the immune response characterized by words such as ifn (interferon), t cellmacrophageepitopesubunit vaccine and proinflammatory cytokine and its consequences as seen by lung pathology and the central nervous system (Figure 7A). Receptor binding domain hints at another direction relating to the study of coronavirus entry, which becomes more evident when we look for terms that are present in highly cited articles using the normalized citation view in Figure 7B.

Figure 7: Visualizing the highest ranking terms in cluster 5 based on term occurrence (A) or average normalized citation (B). Click image for larger view. Data: Lens.org.

In Figure 7B, the words mers cov spikedipeptidyl peptidasedpp4 and hdpp4 became more prominent suggesting that these terms have been used in papers that have received many citations. dpp4, also known as dipeptidyl peptidase 4, is a human protein discovered to be the receptor used by the MERS coronavirus spike protein to infect host cells (article). Joining the dots, we can piece together that perhaps mers cov spikebinding to its receptor dpp4, is being studied in the context of the mouse model. Indeed, this is confirmed by studies that describe the development of a transgenic mouse expressing the human dpp4 gene that is fully permissive to infection by the MERS coronavirus (article). The appearance of the term hdpp4 (human dpp4), is now explained by the fact that mouse dpp4 does not bind the mers cov spike protein well, which necessitated the development of a mouse strain expressing the human version of the dpp4 gene.

When are the terms most utilized?

As a final exploration, I wanted to obtain an idea of the underlying dynamics of field evolution over time. To explore this, the term map in Figure 1 based on occurrence was modified, where the nodes were coloured according to the average year of usage (Figure 8).

Figure 8: Visualizing the frequency of terms within research articles according to average year. Red = most recent, blue = oldest.  Click image for larger view. Data: Lens.org.

The visual provides us with a snapshot in time of when the terms were most popular. Looking at this visual, there appears to be a general gradient in the colours where the newest terms are depicted in red, while the oldest terms are in blue. We can see that most of the terms within the basic science and animal research clusters to the left and middle of the network are depicted in bluish shades, compared to terms within the clinical clusters to the right of the network which appear reddish.

This suggests that much of recent research in the field has been skewed towards clinical research, as opposed to basic science and animal coronavirus research.

While not in the scope of this essay, following up on terms that are trending in recent years (eg. in red) could be worth doing and may reveal emerging areas of research.

Final thoughts

The term map exploration revealed the structure of the coronavirus research field over the last three decades. We learn that coronavirus research has been broadly divided into three major areas: basic science research, clinical/public health research and animal coronavirus research. Subdivisions within basic science and clinical research were also observed, such as that focusing on the mouse model and on viral respiratory infections. Using different views based on occurrence or citation revealed the importance of different terms according to how commonly used they were, such as mouse, or how important they were, such as bat coronavirus.

The data also hints that there has been more focus on clinical and public health research than on basic science over the last few years. Similarly, there appears to be a lower amount of recent activity on animal coronavirus research, with the exception of porcine delta coronavirus and porcine epidemic diarrhea virus due to their resurgence in recent times. This impression could be followed up using citation analysis which can reveal more clearly the evolution of the field.

These preliminary findings points to a knowledge gap that could be filled within the basic science and animal health research areas in the coming years. This may be particularly important, in the light that animal coronaviruses are ubiquitous in nature and the jump from animals to humans is always a probability due to ecological disruptions and possibly due to intensive animal farming. This would be an interesting line of enquiry to pursue.

Also, given our urgent need for new vaccines, I noted the paucity of terms relating to vaccine and immunological research in the field. Whether this is simply an impression or a true fact needs to be ascertained with further analysis, as it may have implications regarding the speed at which new coronavirus vaccines can realistically be developed.

The term map analysis enabled the structure of the research field to reveal itself from the data in a ‘bottom up’ approach. As powerful as it is, this approach is not without its limitations.

Firstly, constructing the current map relied upon the algorithm produced within the VOS Viewer application, where fixed criteria for the calculation of links between the terms have been made within the programme. To explore the connection between terms more deeply will require further experimentation with alternative link calculations that may reveal relationships not present in this term map analysis.

Secondly, the insights presented in this essay is only a fraction of the potential information that can be obtained from the map, and therefore cannot be considered to be a complete description of the field. Displaying the map relied on various cut offs and parameters, which means that the data can be sliced and diced in different ways to reveal new insights. This is both an opportunity for new discovery as well as a possible source of subjective interpretation.

With this in mind, it is therefore important to realize that such a method is not meant as a substitute for deeper studies on the subject, but rather as a way of ‘augmenting intelligence’ on a topic within a relatively short amount of time. Combining this approach together with reading in-depth reviews by leading experts like Stanley Perlman (such as this) has been my way of quickly gaining an understanding of the field.

In the grip of a major pandemic, much research is focused upon the creation of new vaccines and treatments. Yet in order to devise these solutions in a short span of time, we need to build on a body of knowledge and avoid re-inventing the wheel. Used appropriately, new tools from data science and artificial intelligence can help us structure the information and direct our resources to where it matters the most. Bringing more of these insights to the forefront will be the focus of my subequent explorations on this topic.

Acknowledgements: Thanks to MA3X for assistance with data processing