Improving influenza modelling from search query logs

UPDATE: Flu Detector is a real-time tool that uses a refined version of the methodology described below to estimate flu rates in England based on Google search or Twitter content.

This blog post introduces a collaborative work with Andrew Miller from Harvard University as well as Christian Stefansen and Steve Crossan from Google, where we propose a set of improvements to models for nowcasting influenza rates from search engine query logs. The outcomes of this research have been published in Nature Scientific Reports.

You probably know of Google Flu Trends (GFT). It was a platform that displayed weekly estimates of influenza-like illness (ILI) rates all around the world. These estimates were products of a statistical model that mapped the frequency of several search queries to official health surveillance reports. GFT was an important tool because it constituted a complementary indicator to traditional epidemiological schemes, one though that is characterised by a better penetration in the population. Furthermore, GFT's estimates were shown to be more timely, and, in general, such methods can potentially be used in locations with less advanced or even nonexistent healthcare systems.

The original GFT model (10.1038/nature07634) is expanding on ideas presented in 10.1086/593098 or 10.2196/jmir.1157. It is a simple method applied, however, on a massive volume of data. The algorithm has two phases: a) the selection of a small subset of search queries using a more than usual involved correlation analysis, and b) the training of a basic (least squares) regression model that predicts ILI rates using the selected queries. More specifically, the GFT model is proposing that $$\text{ILI } = qw + \beta \, ,$$ where $q$ represents the aggregate frequency of the selected queries for a week, and $w$, $\beta$ are the regression weight and intercept parameters respectively that are learned during model training.1

This algorithm was definitely a good start, but not good enough. Several publications showcased some of the minor or major errors in ILI estimates made during its application (10.1371/journal.pone.0023610, 10.1371/journal.pcbi.1003256, 10.1126/science.1248506).

What can be improved?

In a paper published today, we make an effort to understand more rigorously the possible pitfalls and performance limitations of the original GFT algorithm as well as propose improvements.2 Starting from the query selection method, we no longer separate it from the regression modelling. To do so, we apply a regularised regression technique known as the Elastic Net (10.1111/j.1467-9868.2005.00503.x). This enables more expressiveness, since different queries can have different weights, and simultaneously performs query selection by encouraging sparsity (zero weights). The regression model now becomes $$\text{ILI} = \sum_{i=1}^{n} q_i w_i + \beta \, ,$$ where $q_i$ denotes a query frequency and $w_i$ its respective weight (many of which can be equal to 0).3

Looking at the relationship between the time series of ILI rates and search query frequencies, we realised that nonlinearities may be present. See below, for example, what we got for the query sore throat remedies (denoted by $x_q$) with or without a logit space transformation.4

The present nonlinearities together with a desired property of grouping related queries that may reflect different flu or disease related concepts led to the proposition of a nonlinear Gaussian Process (GP) regression model applied on top of query clusters. The queries entering the GP model were the ones pre-selected by the Elastic Net. A different squared exponential GP kernel (or covariance function) was applied on top of each query cluster (check the paper for more details).

Below, you can see the estimated ILI rates in the US for several years by the investigated regression models (GFT, Elastic Net, GP) compared to weekly ILI reports published by the CDC. Evidently, the GP model yields the best estimates.

To understand the numerical contribution better, the mean absolute percentage of error across 5 flu seasons (2008-2013) for the GFT, Elastic Net and GP models was equal to 20.4%, 11.9% and 10.8% respectively. When it really mattered, i.e. during the high flu circulation weeks in those 5 seasons, the respective error rates were equal to 24.8%, 15.8%, 11%, indicating an error increase for the linear GFT and Elastic Net models, but also a performance stability for the nonlinear GP model.

What went wrong with the previous model?

In the figure above, you can see that the original GFT model is not performing well in many occasions. Subfigure C holds a clear example of that. Focusing on that period of time, we realised that the GFT model was using queries with a dubious or nonexistent relation to flu. The top ones in terms of percentage of impact in the ILI estimates were rsv (24.5% of impact), flu symptoms (17.5%), benzonatate (6.2%), symptoms of pneumonia (6%) and upper respiratory infection (3.9%). Clearly the model was aggregating queries about different health conditions, something that was leading to an over-prediction of ILI rates.


Regression models can be further improved by injecting an autoregressive component, which embeds previous ILI rates, when they become available (CDC reports usually come with a 2-week lag). With autoregression the error in our GP-based ILI estimates drops from 10.8% to 7.3%. The figure below presents this performance improvement visually. However, keep in mind that autoregression also increases the bias of our model's outputs towards the running health surveillance estimates, which are not considered to always be 100% error prone.

The future

Briefly, I see two paths for future research. One should bring the various user-generated data sources (e.g. social media, search query logs, mobile phone logs) together and the other should partly detach this modelling from supervised learning techniques, creating solid complementary disease indicators, independent from the various biases or mistakes that traditional health surveillance schemes may be making.


1, 3, 4: Note that for simplicity in the notation of this blog post, I have omitted the important step of transforming both $q$ or $q_i$ and, as a consequence, the predicted ILI rate into the logit space. Note also that this transformation is reversed, when an estimate is made (see the paper for a more thorough explanation).

2: The modelling improvements proposed in our paper have not been launched at the most recent and, unfortunately, final version of the GFT website.


Vasileios Lampos, Andrew C. Miller, Steve Crossan and Christian Stefansen. Advances in nowcasting influenza-like illness rates using search query logs. Scientific Reports 5, Article Number: 12760, 2015. doi:10.1038/srep12760
[ PDF ] [ HTML ]