Skip to content Skip to navigation

You can find an improved version of this post over there.

This blog post draws a summary on three recent research papers that propose statistical Natural Language Processing frameworks for inferring socioeconomic attributes from social media (Twitter) user profiles. The attributes we have focused on are a user's (a) occupational class (Preotiuc-Pietro, Lampos and Aletras; ACL 2015), (b) income (Preotiuc-Pietro et al. 2015; PLOS ONE, 2015), and (c) socioeconomic status (Lampos et al. 2016; ECIR 2016).

Driven by a quite mature hypothesis

Studies in sociology have deducted that social status influences facets of language (see Bernstein, 1960 or Labov, 1966). Different socioeconomic backgrounds may result in distinctive topics of discussion or even specific (regional) dialects. Taking this notion a notch further, we hypothesise that language in social media may also be indicative of a user's socioeconomic profile. For example, when it comes to posting on Twitter, we expect that middle-aged users with senior managerial roles will be (on average) more formal and less open than younger users who are less established professionally and accumulate a lower income. And if this is true, then we should be able to capture this relationship and derive a statistical map from a user's text and online behaviour to a perceived socioeconomic profile.

The automatic inference of user demographics is useful. Why?

I can infer that some of you have second thoughts: Is this type of modelling useful? Does it violate user privacy? So, before going further into our research, I would like to give my perspective on this. The mainstream answer applies here as well: it depends on how a research development is going to be utilised. The good side of things includes that these methods (a) can provide dynamic, timely and low-cost demographical information complementing the traditional time-consuming and expensive approaches, (b) can be used to support large-scale (computational) social science findings, and (c) can be used to enhance various other tasks that focus on particular stratifications of the population, e.g. health surveillance or social services. Of course, there is also a number of commercial downstream applications that can stem out of this, but this not the main driver of my research. Finally, mean applications may arise, but (a) such tendencies will not be stopped by blocking this line of research, and (b) it is really up to our societies to safeguard user rights in those occasions.

Using traditional nation-wide statistical schemes for formulating our inference tasks

To standardise the explored inference tasks we have used occupational groups, salary bands and socioeconomic status mappings proposed by the Office for National Statistics (ONS) in the United Kingdom (UK). At the centre of all tasks stands the Standard Occupational Classification (SOC) taxonomy. The SOC taxonomy is a hierarchical structure that starts with 9 major occupation classes, denoted by a single digit (1 to 9), and then breaks down to 25 sub-major groups (denoted by 2 digits), 90 minor groups (denoted by 3 digits), and finally 369 unit groups (denoted by 4 digits). Occupations in upper classes require higher levels of education (i.e. a university degree or a Ph.D.), whereas occupations in the lower classes refer to a more elementary skill set. A snapshot of SOC is provided below (left). For the occupational class inference task, we create a labelled 9-class user data set by first mapping (manually) a user profile to a unit 4-digit job group, and then following up the SOC hierarchy to tag it with the corresponding major 1-digit group. So, for each user in our data set we obtain a major occupational class label. For the income inference task, we use ONS' Annual Survey of Hours and Earnings to map a minor (3-digit) occupational group to the mean yearly income for 2013 in British pounds (GBP). Below (right) you will find a corresponding snapshot for this mapping. Finally, for the socioeconomic status inference task, we use a mapping from a unit (4-digit) group to a simplified socioeconomic class: upper, middle or lower. This mapping is encoded in another ONS tool, titled as the National Statistics SocioEconomic Classification (NS-SEC). Note that the main laborious (manual) part in this process is tagging a user profile with a 4-digit job group. However, in order to create a large-scale data set for model training and experimentation one can potentially crowdsource this step.


This blog post introduces a collaborative work with Andrew Miller from Harvard University as well as Christian Stefansen and Steve Crossan from Google, where we propose a set of improvements to models for nowcasting influenza rates from search engine query logs. The outcomes of this research have been published in Nature Scientific Reports.

You probably know of Google Flu Trends (GFT). It was a platform that displayed weekly estimates of influenza-like illness (ILI) rates all around the world. These estimates were products of a statistical model that mapped the frequency of several search queries to official health surveillance reports. GFT was an important tool because it constituted a complementary indicator to traditional epidemiological schemes, one though that is characterised by a better penetration in the population. Furthermore, GFT's estimates were shown to be more timely, and, in general, such methods can potentially be used in locations with less advanced or even nonexistent healthcare systems.

The original GFT model (10.1038/nature07634) is expanding on ideas presented in 10.1086/593098 or 10.2196/jmir.1157. It is a simple method applied, however, on a massive volume of data. The algorithm has two phases: a) the selection of a small subset of search queries using a more than usual involved correlation analysis, and b) the training of a basic (least squares) regression model that predicts ILI rates using the selected queries. More specifically, the GFT model is proposing that $$\text{ILI } = qw + \beta \, ,$$ where $q$ represents the aggregate frequency of the selected queries for a week, and $w$, $\beta$ are the regression weight and intercept parameters respectively that are learned during model training.1

This algorithm was definitely a good start, but not good enough. Several publications showcased some of the minor or major errors in ILI estimates made during its application (10.1371/journal.pone.0023610, 10.1371/journal.pcbi.1003256, 10.1126/science.1248506).

What can be improved?

In a paper published today, we make an effort to understand more rigorously the possible pitfalls and performance limitations of the original GFT algorithm as well as propose improvements.2 Starting from the query selection method, we no longer separate it from the regression modelling. To do so, we apply a regularised regression technique known as the Elastic Net (10.1111/j.1467-9868.2005.00503.x). This enables more expressiveness, since different queries can have different weights, and simultaneously performs query selection by encouraging sparsity (zero weights). The regression model now becomes $$\text{ILI} = \sum_{i=1}^{n} q_i w_i + \beta \, ,$$ where $q_i$ denotes a query frequency and $w_i$ its respective weight (many of which can be equal to 0).3

Looking at the relationship between the time series of ILI rates and search query frequencies, we realised that nonlinearities may be present. See below, for example, what we got for the query sore throat remedies (denoted by $x_q$) with or without a logit space transformation.4

The present nonlinearities together with a desired property of grouping related queries that may reflect different flu or disease related concepts led to the proposition of a nonlinear Gaussian Process (GP) regression model applied on top of query clusters. The queries entering the GP model were the ones pre-selected by the Elastic Net. A different squared exponential GP kernel (or covariance function) was applied on top of each query cluster (check the paper for more details).

Below, you can see the estimated ILI rates in the US for several years by the investigated regression models (GFT, Elastic Net, GP) compared to weekly ILI reports published by the CDC. Evidently, the GP model yields the best estimates.

To understand the numerical contribution better, the mean absolute percentage of error across 5 flu seasons (2008-2013) for the GFT, Elastic Net and GP models was equal to 20.4%, 11.9% and 10.8% respectively. When it really mattered, i.e. during the high flu circulation weeks in those 5 seasons, the respective error rates were equal to 24.8%, 15.8%, 11%, indicating an error increase for the linear GFT and Elastic Net models, but also a performance stability for the nonlinear GP model.

Can user-generated Internet content be used to assess the impact of a health intervention? In a new paper published in Data Mining and Knowledge Discovery, we propose a method for estimating the impact of a vaccination program for influenza based on social media content (Twitter) and search query data (Bing). The work has been done in collaboration with Public Health England and Microsoft Research, was funded by the interdisciplinary project i-sense and will be presented at the journal track of ECML PKDD 2015 in September.

Assessing the effect of a health-oriented intervention by traditional epidemiological methods is commonly based only on population segments that use healthcare services. Here we introduce a complementary framework for evaluating the impact of a targeted intervention, such as a vaccination campaign against an infectious disease, through a statistical analysis of user-generated content submitted on web platforms. Using supervised learning, we derive a nonlinear regression model for estimating the prevalence of a health event in a population from Internet data. This model is applied to identify control location groups that correlate historically with the areas, where a specific intervention campaign has taken place. We then determine the impact of the intervention by inferring a projection of the disease rates that could have emerged in the absence of a campaign. Our case study focuses on the influenza vaccination program that was launched in England during the 2013/14 season, and our observations consist of millions of geo-located search queries to the Bing search engine and posts on Twitter. The impact estimates derived from the application of the proposed statistical framework support conventional assessments of the campaign.


Vasileios Lampos, Elad Yom-Tov, Richard Pebody and Ingemar J. Cox. Assessing the impact of a health intervention via user-generated Internet content. Data Mining and Knowledge Discovery 29(5), pp. 1434-1457, 2015. doi: 10.1007/s10618-015-0427-9
Paper | Supplementary Material

In our ACL '15 paper — co-authored with Daniel Preoţiuc-Pietro and Nikolaos Aletras — "An analysis of the user occupational class through Twitter content," we explore the dynamics of social media information in the task of inferring the occupational class of users. We base our analysis on the Standard Occupational Classification from the Office of National Statistics in the UK, which encloses 9 extensive categories of occupations.

The investigated methods take advantage of the user's textual input as well as platform-oriented characteristics (interaction, impact, usage). The best performing methodology uses a neural clustering technique (spectral clustering on neural word embeddings) and a Gaussian Process model for conducting the classification. It delivers a 52.7% accuracy in predicting the user's occupational class, a very decent performance for a 9-way classification task.

Our qualitative analysis confirms the generic hypothesis of occupational class separation as indicated by the language usage for the different job categories. This can be due to a different topical focus, e.g. artists will talk about art, but also due to more generic behaviours, e.g. the lower-ranked occupational classes tend to use more elongated words, whereas higher-ranked occupations tend to discuss more about Politics or Education.

We are also making the data of the paper available (README).

D. Preoţiuc-Pietro, V. Lampos and N. Aletras. An analysis of the user occupational class through Twitter content. ACL '15, pp. 1754-1764, 2015.


Subscribe to RSS - blogs