This blog post draws a summary on three recent research papers that propose statistical Natural Language Processing frameworks for inferring socioeconomic attributes from social media (Twitter) user profiles. The attributes we have focused on are a user's (a) occupational class (Preotiuc-Pietro, Lampos and Aletras; ACL 2015), (b) income (Preotiuc-Pietro et al. 2015; PLOS ONE, 2015), and (c) socioeconomic status (Lampos et al. 2016; ECIR 2016).
Studies in sociology have deducted that social status influences facets of language (see Bernstein, 1960 or Labov, 1966). Different socioeconomic backgrounds may result in distinctive topics of discussion or even specific (regional) dialects. Taking this notion a notch further, we hypothesise that language in social media may also be indicative of a user's socioeconomic profile. For example, when it comes to posting on Twitter, we expect that middle-aged users with senior managerial roles will be (on average) more formal and less open than younger users who are less established professionally and accumulate a lower income. And if this is true, then we should be able to capture this relationship and derive a statistical map from a user's text and online behaviour to a perceived socioeconomic profile.
I can infer that some of you have second thoughts: Is this type of modelling useful? Does it violate user privacy? So, before going further into our research, I would like to give my perspective on this. The mainstream answer applies here as well: it depends on how a research development is going to be utilised. The good side of things includes that these methods (a) can provide dynamic, timely and low-cost demographical information complementing the traditional time-consuming and expensive approaches, (b) can be used to support large-scale (computational) social science findings, and (c) can be used to enhance various other tasks that focus on particular stratifications of the population, e.g. health surveillance or social services. Of course, there is also a number of commercial downstream applications that can stem out of this, but this not the main driver of my research. Finally, mean applications may arise, but (a) such tendencies will not be stopped by blocking this line of research, and (b) it is really up to our societies to safeguard user rights in those occasions.
To standardise the explored inference tasks we have used occupational groups, salary bands and socioeconomic status mappings proposed by the Office for National Statistics (ONS) in the United Kingdom (UK). At the centre of all tasks stands the Standard Occupational Classification (SOC) taxonomy. The SOC taxonomy is a hierarchical structure that starts with 9 major occupation classes, denoted by a single digit (1 to 9), and then breaks down to 25 sub-major groups (denoted by 2 digits), 90 minor groups (denoted by 3 digits), and finally 369 unit groups (denoted by 4 digits). Occupations in upper classes require higher levels of education (i.e. a university degree or a Ph.D.), whereas occupations in the lower classes refer to a more elementary skill set. A snapshot of SOC is provided below (left). For the occupational class inference task, we create a labelled 9-class user data set by first mapping (manually) a user profile to a unit 4-digit job group, and then following up the SOC hierarchy to tag it with the corresponding major 1-digit group. So, for each user in our data set we obtain a major occupational class label. For the income inference task, we use ONS' Annual Survey of Hours and Earnings to map a minor (3-digit) occupational group to the mean yearly income for 2013 in British pounds (GBP). Below (right) you will find a corresponding snapshot for this mapping. Finally, for the socioeconomic status inference task, we use a mapping from a unit (4-digit) group to a simplified socioeconomic class: upper, middle or lower. This mapping is encoded in another ONS tool, titled as the National Statistics SocioEconomic Classification (NS-SEC). Note that the main laborious (manual) part in this process is tagging a user profile with a 4-digit job group. However, in order to create a large-scale data set for model training and experimentation one can potentially crowdsource this step.
A social media user creates a number of trails: textual information in posts, platform behaviour, perceived impact, profile information and so on. In our models, we have tried to incorporate all these features. We have also investigated the contribution of more advanced, but at the same time more approximate, user characteristics; we refer to them as perceived (as we inferred them separately from the user data) psycho-demographics. These include gender, age, political orientation, relationship, religion, education, as well as sentiment and emotions.
In all our experimental setups, the topics of discussion consistently provided the best statistical traction. To form clusters of keywords based on Twitter content we compiled and utilised a held-out corpus containing millions of tweets. We first computed a word-by-word similarity matrix using a tweet as our context. The applied similarity metric was the Normalised Pointwise Mutual Information (NPMI) introduced by Bouma (2009). Then, as we were interested in obtaining hard latent topic clusters, we applied spectral clustering on the word-by-word similarity matrix. In other words, we performed Singular Value Decomposition (SVD) on the graph Laplacian of the NPMI word-by-word matrix (which is a graph after all).
Recently, there has been a growing interest in neural language models, where the words are projected into a lower dimensional dense vector space via a hidden layer (see Mikolov et al., 2013a and Mikolov et al., 2013b). Therefore, we also used the skip-gram (word2vec) model with negative sampling (as implemented in the gensim library) to learn word embeddings on a held-out Twitter reference corpus. This time we replaced the NPMI metric with a cosine similarity between all pairs of word embeddings. Similarly with the previous approach, we then applied spectral clustering on the derived word-by-word similarity matrix. Following the recent tendency, the neural clusters improved the inference performance further. A snapshot with the most relevant topics in predicting a user's income is given below (the last column holds a parameter that is inversely proportional to topic relevancy, i.e. the smaller the better). Talking about politics, Non Government Organisations (NGOs) or even using swear words are among the most income-predictive discussion themes.
Gaussian Processes (GPs) provide a powerful, adaptive, nonparametric and nonlinear (some may also add Bayesian, but this is not always true) modelling framework. A Gaussian Process is defined by a mean function (on the input space) and a covariance function (or kernel) on pairs of the input space. These two functions are responsible for modelling target variables any finite number of which have a multivariate Gaussian distribution (Rasmussen and Williams, 2006). Throughout our experiments, we have seen that GPs were able to capture better the multimodal feature space we have been operating on, and at the same time provide a significant level of interpretability (e.g. by looking at the length-scale parameters in a covariance function) that other strong (nonlinear) learners do not. On top of this, GPs are very straightforward to try (MATLAB library, Python library) and, most importantly, are modular enough to host creative ideas (for example, valid GP kernels can be added and multiplied without losing their GP identity).
Across all tasks we have obtained promising performance figures. The nonlinear GP models were consistently performing better than broadly applied solvers, such as Support Vector Machines (using the Radial Basis Function kernel) and regularised Logistic Regression. Briefly, for the 9-way classification task of the occupational class inference (see below on the left), we reached a 52.7% accuracy using the GP-based model on 200 neural topics. Note that (a) the SVM provided a 1% lower performance as well as a model that is hard to interpret, and (b) for this task non textual user attributes did not have a significant predictive power. For the income inference, I only show the figures for the best performing GP model across the various feature sets as well as for a combination of all features in a linear ensemble (see below on the right). The best Mean Absolute Error (MAE) is obtained when all features are combined (GBP 9,535), but it does not differ much from one obtained when discussion topics were used alone (GBP 9,621).
Finally, in the 3-way socioeconomic status classification task (check the performance table on the right), we obtained an accuracy of 75.09%. When we converted this task into a binary classification by merging the users with a lower or middle socioeconomic status, the classification accuracy increased to 82.05% (expected). For the record, the GP classifier that was used in these approaches combined all user features categories by defining a covariance function for each one of them, and then producing the sum of these covariance functions, i.e. performing a feature combination inside the GP, instead of doing it outside (e.g. with a linear ensemble).
Discussion topics have been the strongest predictor of the investigated user demographics. Thus, it is no surprise that users with adjacent occupational classes have similar topic distributions. This is confirmed by measuring and visualising the Jensen-Shannon Divergence (the smaller the more similar, of course) between the topic distributions of all class pairs (see the figure on the right). Expected user class clusters emerge; I have circled some of them.
Following up on topics, and focusing on these with the highest predictive relevance in the user occupation classification task, we visualise a topic's CDF across the users of the 9 occupational classes. A CDF indicates the fraction of users having at least a certain topic proportion in their tweets. Visually, a topic is more dominant in a class, if the CDF line leans towards the bottom-right corner of the plot. In the figures below, you see that the topic of Higher Education (left figure) is more prevalent in classes 1 and 2, but is also discriminative of classes 3 and 4 from the rest. This is expected because the vast majority of jobs in these classes require a university degree or are actually jobs in higher education. By examining the topic of Arts (middle figure), we see that it clearly separates class 5 from all other classes. Class 5 is indeed the class that enlists artistic professions. Finally, the topic of Elongated Words (a.k.a. Twitter slang) is more prevalent in the lower occupational classes.
Moving to income modelling, we looked at the relationship between the perceived user demographics with the corresponding perceived income (see the figure below). This was somewhat required in order to validate that our data, especially the perceived demographic attributes, were capturing reasonable trends (even when representing the population of Twitter users). Indeed, our data confirmed that (our world is not fair) income (a) increases with age, (b) is higher for higher levels of education, and (c) is lower for females and African Americans on average.
We then looked at the relationship between the most influential topics in the inference of user income and income itself (see the figure below). Apart from the better performing nonlinear trend, we also plotted the linear one (obtained via regularised logistic regression) to showcase examples where the linear model is less flexible (and is potentially making a mistake) in capturing this relationship. We generally observe that users talk more about Politics, NGOs, and Corporate themes as their income gets higher. On the other hand, an opposite relationship is present for the use of swear words.
We also did the same analysis for sentiment and emotions vs. user income (see the figure below). Our analysis unveils that neutral sentiment increases with income, while both positive and negative sentiment decrease, i.e. lower income users are probably more subjective! In addition, the emotions of anger and fear are more present in users with higher income, while sadness, surprise and disgust are more associated with lower income.
The (aggregated) data sets used in the above research efforts have been made publicly available. Below you can find direct links to each one of them. Please refer to the "Data" sections of the corresponding papers to read their complete description.
D. Preotiuc-Pietro, V. Lampos and N. Aletras. An analysis of the user occupational class through Twitter content. ACL, 2015. [ data ]
D. Preotiuc-Pietro, S. Volkova, V. Lampos, Y. Bachrach and N. Aletras. Studying User Income through Language, Behaviour and Affect in Social Media. PLOS ONE, 2015. [ data ]
V. Lampos, N. Aletras, J. K. Geyti, B. Zou and I. J. Cox. Inferring the Socioeconomic Status of Social Media Users Based on Behaviour and Language. ECIR, 2016. [ data ]