Skip to content Skip to navigation


This blog post introduces a collaborative work with Andrew Miller from Harvard University as well as Christian Stefansen and Steve Crossan from Google, where we propose a set of improvements to models for nowcasting influenza rates from search engine query logs. The outcomes of this research are published in Nature Scientific Reports.

You probably know of Google Flu Trends (GFT). It was a platform that displayed weekly estimates of influenza-like illness (ILI) rates all around the world. These estimates were products of a statistical model that mapped the frequency of several search queries to official health surveillance reports. GFT was an important tool because it constituted a complementary indicator to traditional epidemiological schemes, one though that is characterised by a better penetration in the population. Furthermore, GFT's estimates were shown to be more timely, and, in general, such methods can potentially be used in locations with less advanced or even nonexistent healthcare systems.

The original GFT model (10.1038/nature07634) is expanding on ideas presented in 10.1086/593098 or 10.2196/jmir.1157. It is a simple method applied, however, on a massive volume of data. The algorithm has two phases: a) the selection of a small subset of search queries using a more than usual involved correlation analysis, and b) the training of a basic (least squares) regression model that predicts ILI rates using the selected queries. More specifically, the GFT model is proposing that $$\text{ILI } = qw + \beta \, ,$$ where $q$ represents the aggregate frequency of the selected queries for a week, and $w$, $\beta$ are the regression weight and intercept parameters respectively that are learned during model training.1

This algorithm was definitely a good start, but not good enough. Several publications showcased some of the minor or major errors in ILI estimates made during its application (10.1371/journal.pone.0023610, 10.1371/journal.pcbi.1003256, 10.1126/science.1248506).

What can be improved?

In a paper published today, we make an effort to understand more rigorously the possible pitfalls and performance limitations of the original GFT algorithm as well as propose improvements.2 Starting from the query selection method, we no longer separate it from the regression modelling. To do so, we apply a regularised regression technique known as the Elastic Net (10.1111/j.1467-9868.2005.00503.x). This enables more expressiveness, since different queries can have different weights, and simultaneously performs query selection by encouraging sparsity (zero weights). The regression model now becomes $$\text{ILI} = \sum_{i=1}^{n} q_i w_i + \beta \, ,$$ where $q_i$ denotes a query frequency and $w_i$ its respective weight (many of which can be equal to 0).3

Looking at the relationship between the time series of ILI rates and search query frequencies, we realised that nonlinearities may be present. See below, for example, what we got for the query sore throat remedies (denoted by $x_q$) with or without a logit space transformation.4

The present nonlinearities together with a desired property of grouping related queries that may reflect different flu or disease related concepts led to the proposition of a nonlinear Gaussian Process (GP) regression model applied on top of query clusters. The queries entering the GP model were the ones pre-selected by the Elastic Net. A different squared exponential GP kernel (or covariance function) was applied on top of each query cluster (check the paper for more details).

Below, you can see the estimated ILI rates in the US for several years by the investigated regression models (GFT, Elastic Net, GP) compared to weekly ILI reports published by the CDC. Evidently, the GP model yields the best estimates.

To understand the numerical contribution better, the mean absolute percentage of error across 5 flu seasons (2008-2013) for the GFT, Elastic Net and GP models was equal to 20.4%, 11.9% and 10.8% respectively. When it really mattered, i.e. during the high flu circulation weeks in those 5 seasons, the respective error rates were equal to 24.8%, 15.8%, 11%, indicating an error increase for the linear GFT and Elastic Net models, but also a performance stability for the nonlinear GP model.

Can user-generated Internet content be used to assess the impact of a health intervention? In a new paper published in Data Mining and Knowledge Discovery, we propose a method for estimating the impact of a vaccination program for influenza based on social media content (Twitter) and search query data (Bing). The work has been done in collaboration with Public Health England and Microsoft Research, was funded by the interdisciplinary project i-sense and will be presented at the journal track of ECML PKDD 2015 in September.

Assessing the effect of a health-oriented intervention by traditional epidemiological methods is commonly based only on population segments that use healthcare services. Here we introduce a complementary framework for evaluating the impact of a targeted intervention, such as a vaccination campaign against an infectious disease, through a statistical analysis of user-generated content submitted on web platforms. Using supervised learning, we derive a nonlinear regression model for estimating the prevalence of a health event in a population from Internet data. This model is applied to identify control location groups that correlate historically with the areas, where a specific intervention campaign has taken place. We then determine the impact of the intervention by inferring a projection of the disease rates that could have emerged in the absence of a campaign. Our case study focuses on the influenza vaccination program that was launched in England during the 2013/14 season, and our observations consist of millions of geo-located search queries to the Bing search engine and posts on Twitter. The impact estimates derived from the application of the proposed statistical framework support conventional assessments of the campaign.


Vasileios Lampos, Elad Yom-Tov, Richard Pebody and Ingemar J. Cox. Assessing the impact of a health intervention via user-generated Internet content. Data Mining and Knowledge Discovery 29(5), pp. 1434-1457, 2015. doi: 10.1007/s10618-015-0427-9
Paper | Supplementary Material

In our ACL '15 paper — co-authored with Daniel Preoţiuc-Pietro and Nikolaos Aletras — "An analysis of the user occupational class through Twitter content," we explore the dynamics of social media information in the task of inferring the occupational class of users. We base our analysis on the Standard Occupational Classification from the Office of National Statistics in the UK, which encloses 9 extensive categories of occupations.

The investigated methods take advantage of the user's textual input as well as platform-oriented characteristics (interaction, impact, usage). The best performing methodology uses a neural clustering technique (spectral clustering on neural word embeddings) and a Gaussian Process model for conducting the classification. It delivers a 52.7% accuracy in predicting the user's occupational class, a very decent performance for a 9-way classification task.

Our qualitative analysis confirms the generic hypothesis of occupational class separation as indicated by the language usage for the different job categories. This can be due to a different topical focus, e.g. artists will talk about art, but also due to more generic behaviours, e.g. the lower-ranked occupational classes tend to use more elongated words, whereas higher-ranked occupations tend to discuss more about Politics or Education.

We are also making the data of the paper available (README).

D. Preoţiuc-Pietro, V. Lampos and N. Aletras. An analysis of the user occupational class through Twitter content. ACL '15, pp. 1754-1764, 2015.

Here's a snapshot taken from our paper "Predicting and Characterising User Impact on Twitter" that will be presented at EACL '14.

In this work, we tried to predict the impact of Twitter accounts based on actions under their direct control (e.g. posting quantity and frequency, numbers of @-mentions or @-replies and so on), including textual features such as word or topic (word-cluster) frequencies. Given the decent inference performance, we then digged further into our models and qualitatively analysed their properties from a variety of angles in an effort to discover the specific user behaviours that are decisive impact-wise.

On this figure, for example, we have plotted the impact score distribution for accounts with low (L) and high (H) participation in certain behaviours compared to the average impact score across all users in our sample (red line). The chosen user behaviours, i.e. the total number of tweets, the numbers of @-replies, links and unique @-mentions, and the days with nonzero tweets, were among the most relevant ones for predicting an account's impact score.

In a nutshell, based on our experimental process, we concluded that activity, interactivity (especially when interacting with a broader set of accounts rather than a specific clique) and engagement on a diverse set of topics (with the most prevalent themes of discussion being politics and showbiz) play a significant role in determining the impact of an account. For a more detailed analysis, however, you'll need to read the actual paper.

How to read the figure: The 2nd top-down subfigure indicates that users with very few unique @-mentions (those who mention a very limited set of other accounts) have an impact score approximately identical to the average (red line), whereas users with a high number of unique @-mentions tend to be distributed across impact scores distinctively higher than the average one.


Subscribe to RSS - blogs