Google Flu Trends modelling revisited

This blog post introduces a collaborative work with Andrew Miller from Harvard University as well as Christian Stefansen and Steve Crossan from Google, where we propose a set of improvements to the models behind Google Flu Trends. The outcomes of this research are published in Nature Scientific Reports.

You probably know of Google Flu Trends (GFT). It is a platform that displays weekly estimates of influenza-like illness (ILI) rates all around the world. Those estimates are products of a statistical model that maps the frequency of several search queries to official health surveillance reports. GFT is an important tool because it constitutes a complementary indicator to traditional epidemiological schemes, one though that is characterised by a better penetration in the population. Furthermore, GFT's estimates can be more timely, and such methods can potentially be used in locations with less advanced or even nonexistent healthcare systems.

The original GFT model (10.1038/nature07634) is expanding on ideas presented in 10.1086/593098 or 10.2196/jmir.1157. It is a simple method applied, however, on a massive volume of data. The algorithm has two phases: a) the selection of a small subset of search queries using a more than usual involved correlation analysis, and b) the training of a basic (least squares) regression model that predicts ILI rates using the selected queries. More specifically, the GFT model is proposing that $$\text{ILI } = qw + \beta \, ,$$ where $q$ represents the aggregate frequency of the selected queries for a week, and $w$, $\beta$ are the regression weight and intercept parameters respectively that are learned during model training.1

This algorithm was definitely a good start, but not good enough. Several publications showcased some of the minor or major errors in ILI estimates made during its application (10.1371/journal.pone.0023610, 10.1371/journal.pcbi.1003256, 10.1126/science.1248506).

What can be improved?

In a paper published today, we make an effort to understand more rigorously the possible pitfalls and performance limitations of the original GFT algorithm as well as propose improvements.2 Starting from the query selection method, we no longer separate it from the regression modelling. To do so, we apply a regularised regression technique known as the Elastic Net (10.1111/j.1467-9868.2005.00503.x). This enables more expressiveness, since different queries can have different weights, and simultaneously performs query selection by encouraging sparsity (zero weights). The regression model now becomes $$\text{ILI} = \sum_{i=1}^{n} q_i w_i + \beta \, ,$$ where $q_i$ denotes a query frequency and $w_i$ its respective weight (many of which can be equal to 0).3

Looking at the relationship between the time series of ILI rates and search query frequencies, we realised that nonlinearities may be present. See below, for example, what we got for the query sore throat remedies (denoted by $x_q$) with or without a logit space transformation.4

The present nonlinearities together with a desired property of grouping related queries that may reflect different flu or disease related concepts led to the proposition of a nonlinear Gaussian Process (GP) regression model applied on top of query clusters. The queries entering the GP model were the ones pre-selected by the Elastic Net. A different squared exponential GP kernel (or covariance function) was applied on top of each query cluster (check the paper for more details).

Below, you can see the estimated ILI rates in the US for several years by the investigated regression models (GFT, Elastic Net, GP) compared to weekly ILI reports published by the CDC. Evidently, the GP model yields the best estimates.

To understand the numerical contribution better, the mean absolute percentage of error across 5 flu seasons (2008-2013) for the GFT, Elastic Net and GP models was equal to 20.4%, 11.9% and 10.8% respectively. When it really mattered, i.e. during the high flu circulation weeks in those 5 seasons, the respective error rates were equal to 24.8%, 15.8%, 11%, indicating an error increase for the linear GFT and Elastic Net models, but also a performance stability for the nonlinear GP model.

Assessing the impact of a health intervention via user-generated Internet content

Can user-generated Internet content be used to assess the impact of a health intervention? In a new paper published in Data Mining and Knowledge Discovery, we propose a method for estimating the impact of a vaccination program for influenza based on social media content (Twitter) and search query data (Bing). The work has been done in collaboration with Public Health England and Microsoft Research, was funded by the interdisciplinary project i-sense and will be presented at the journal track of ECML PKDD 2015 in September.

Assessing the effect of a health-oriented intervention by traditional epidemiological methods is commonly based only on population segments that use healthcare services. Here we introduce a complementary framework for evaluating the impact of a targeted intervention, such as a vaccination campaign against an infectious disease, through a statistical analysis of user-generated content submitted on web platforms. Using supervised learning, we derive a nonlinear regression model for estimating the prevalence of a health event in a population from Internet data. This model is applied to identify control location groups that correlate historically with the areas, where a specific intervention campaign has taken place. We then determine the impact of the intervention by inferring a projection of the disease rates that could have emerged in the absence of a campaign. Our case study focuses on the influenza vaccination program that was launched in England during the 2013/14 season, and our observations consist of millions of geo-located search queries to the Bing search engine and posts on Twitter. The impact estimates derived from the application of the proposed statistical framework support conventional assessments of the campaign.

Vasileios Lampos, Elad Yom-Tov, Richard Pebody and Ingemar J. Cox. Assessing the impact of a health intervention via user-generated Internet content. Data Mining and Knowledge Discovery, 2015. doi: 10.1007/s10618-015-0427-9
Paper | Supplementary Material

An analysis of the user occupational class through Twitter content

In our ACL '15 paper — co-authored with Daniel Preoţiuc-Pietro and Nikolaos Aletras — "An analysis of the user occupational class through Twitter content," we explore the dynamics of social media information in the task of inferring the occupational class of users. We base our analysis on the Standard Occupational Classification from the Office of National Statistics in the UK, which encloses 9 extensive categories of occupations.

The investigated methods take advantage of the user's textual input as well as platform-oriented characteristics (interaction, impact, usage). The best performing methodology uses a neural clustering technique (spectral clustering on neural word embeddings) and a Gaussian Process model for conducting the classification. It delivers a 52.7% accuracy in predicting the user's occupational class, a very decent performance for a 9-way classification task.

Our qualitative analysis confirms the generic hypothesis of occupational class separation as indicated by the language usage for the different job categories. This can be due to a different topical focus, e.g. artists will talk about art, but also due to more generic behaviours, e.g. the lower-ranked occupational classes tend to use more elongated words, whereas higher-ranked occupations tend to discuss more about Politics or Education.

We are also making the data of the paper available (README).

D. Preoţiuc-Pietro, V. Lampos and N. Aletras. An analysis of the user occupational class through Twitter content. ACL '15, pp. 1754-1764, 2015.

Predicting and characterising user impact on Twitter

How to read this figure: The 2nd top-down subfigure indicates that users with very few unique @-mentions (those who mention a very limited set of other accounts) have an impact score approximately identical to the average (red line), whereas users with a high number of unique @-mentions tend to be distributed across impact scores distinctively higher than the average one.

Here's a snapshot taken from our paper "Predicting and Characterising User Impact on Twitter" that will be presented at EACL '14.

In this work, we tried to predict the impact of Twitter accounts based on actions under their direct control (e.g. posting quantity and frequency, numbers of @-mentions or @-replies and so on), including textual features such as word or topic (word-cluster) frequencies. Given the decent inference performance, we then digged further into our models and qualitatively analysed their properties from a variety of angles in an effort to discover the specific user behaviours that are decisive impact-wise.

On this figure, for example, we have plotted the impact score distribution for accounts with low (L) and high (H) participation in certain behaviours compared to the average impact score across all users in our sample (red line). The chosen user behaviours, i.e. the total number of tweets, the numbers of @-replies, links and unique @-mentions, and the days with nonzero tweets, were among the most relevant ones for predicting an account's impact score.

In a nutshell, based on our experimental process, we concluded that activity, interactivity (especially when interacting with a broader set of accounts rather than a specific clique) and engagement on a diverse set of topics (with the most prevalent themes of discussion being politics and showbiz) play a significant role in determining the impact of an account. For a more detailed analysis, however, you'll need to read the actual paper.

Emotions in books reflect economic misery

Fig. 1(a): Pearson's r between Literary Misery (LM) and Economic Misery (EM) for various smoothing windows, using a lagged version or a moving average over the past years.

Fig. 1(b): Literary Misery (LM) versus a moving average of the Economic Misery (EM) using the past 11 years (t=11).

In previous works, we have investigated emotion signals in English books (PLOS ONE, 2013) as well as the robustness of such signals under various metrics and statistical tests (Big Data '13).

Extending our previous research, we are now showing how emotions in books could correlate with systemic factors, such as the status of an economy (PLOS ONE, 2014). In our main experiment, we use a composite economic index that represents unemployment and inflation through the years, titled as Economic Misery (EM), and correlate it against a Literary Misery index (LM), that represents the composite emotion of Sadness minus Joy in books. We observe the best correlations, when EM is averaged over the past decade (see Fig. 1(a) & 1(b)); correlations increase for the period of 1929 (Great Depression) onwards. Interestingly, we get very similar results for books written in American English, British English and German when compared to their local EM indices (i.e. for the US, UK and Germany respectively). For more methodological details, a better presentation of all the results and an interesting discussion, where we argue that causation may be the reason behind this correlation, I have to point you to the actual paper.

Press Release: University of Bristol
Media Coverage: The Guardian, New York Times, Independent, The Conversation

Bentley A.R., Acerbi A., Ormerod P. and Lampos V. Books average previous decade of economic misery. PLOS ONE, 2014.

Predicting voting intention from Twitter

Many past research efforts have tried to exploit human-generated content posted on Social Media platforms to predict the result of an election1,2 or of various sociopolitical polls including the ones targeting voting intentions3,4. Most papers on the topic, however, received waves of criticism as their methods failed to generalise when applied to different case studies. For example, a paper5 showed that prior approaches1,3 did not predict the result of the US congressional elections in 2009.

One may also spot various other shortcomings of these approaches. Firstly, the modelling procedure is often biased towards specific sentiment analysis tools6,7. Therefore, the rich textual information is compressed to quite a few features expressing different sentiment or affective types. Apart from their ambiguity and overall moderate performance, those tools are also language-dependent to a significant extent and in most occasions machine-translating them creates problematic outputs. Furthermore, Social Media content filtering is performed using handcrafted lists of task-related terms (e.g., names of politicians or political parties) despite the obvious fact that the keywords of interest will change as new political players or events come into play. Most importantly, methods are mainly focusing on the textual content only without making any particular effort to model individual users or to jointly account for both words and user impact. Finally, predictions are one-dimensional meaning that one variable (a party or a specific opinion poll variable) is modelled each time. However, in a scenario where political entities are competing with each other, it would make sense to use a multi-task learning approach that will incororate all political players in one shared model.

As part of our latest research, we propose a method for text regression tasks capable of modelling both word frequencies and user impact. Thus, we are now in the position to filter (select) and weigh not only task-related words, but also task-related users. The general setting is supervised learning performed via regularised regression that, in turn, favours sparsity — from tens of thousands of words and users in a Social Media data set, we are selecting compact subsets that ideally will be related to the task at hand.

Emotions in English books

This drawing (HQ version) shows a result from our research published today in PLOS ONE. It actually is a simple plot depicting the difference between the emotions of Joy and Sadness in approx. 5 million books published in the 20th century (1900-2000). You may have noticed that the peak for Sadness — and equivalently the minimum level of Joy — occurred during the World War II period.

However, for people who despise simplicity, we have also included some more elaborate (and perhaps more interesting) results in our paper. I'm not going to repeat them here, of course, as it makes no sense — that's the purpose of the publication!

If you feel that reading an academic paper is a painful task (even for this easy-to-read, made-for-humans paper), then you might find the following press releases useful: PLoS ONE, University of Bristol or University of Sheffield, Nature.

Acerbi A., Lampos V., Garnett P., Bentley R.A. (2013) The Expression of Emotions in 20th Century Books. PLOS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030

Flu Detector in the news

Flu Detector - Tracking Epidemics on Twitter - Nowcasting Flu Rates from Twitter contentOur latest work, where influenza-like illness rates are predicted from the content of Twitter, has been featured in mainstream media about technology and science.

BBC Radio 4, Costing the Earth (March 21, 2012)
by Martin Poyntz-Roberts

Examiner, Brits study forecasting flu outbreaks using Twitter (December 27, 2011)
by Linda Chalmer Zemel

New Scientist, Engines of the future: The cyber crystal ball (November 17, 2010)
by Phil McKenna

MIT Technology Review, How Twitter Could Better Predict Disease Outbreaks (July 14, 2010)
by Christopher Mims

Press releases
University of Bristol News, Could social media be used to detect disease outbreaks? (November 1, 2011)
University of Bristol, Computer Science Department News, Predicting Flu from the content of Twitter (July 1, 2010)

Lampos and Cristianini. Nowcasting Events from the Social Web with Statistical Learning. ACM TIST, 2012. [ pdf ]
Lampos, De Bie and Cristianini. Flu Detector - Tracking Epidemics on Twitter. ECML/PKDD '10. [ pdf ]
Lampos and Cristianini. Tracking the flu pandemic by monitoring the Social Web. CIP'10. [ pdf ]

LaTeX beamer how-to

Screenshot - LaTeX presentation using the Beamer classWhy not produce a presentation using a WYSIWYG editor (like MS Powerpoint)? For the same reasons you won't write your publications or books in MS Word. Coding your presentation in LaTeX with the 'help' of the beamer class makes the task easy, especially if you already have some experience with TeX scripting.

Beamer is a LaTeX class which provides a lot of easy-to-deploy commands, tricks, templates and libraries for producing presentations. I was searching the web for an easy-to-follow and use application of it, but this process wasn't very straightforward, so when I finally compiled a LaTeX presentation, I thought it might be useful to make the source code publicly available. Therefore, the attached documents are only good for beamer beginners (since they are the result of my first LaTeX presentation tryout).