I am currently based at the Computer Science Department of University College London. My primary research focus is the analysis of textual user-generated content published on the Web (social media, search query logs etc.) and of the respective user behaviour attached to it. I am also interested in interdisciplinary research tasks that bring together Computer Science, Healthcare, Statistics and the Social Sciences.
Recent news snippets
- [05/12/2015] Paper that proposes a method for inferring the socioeconomic status of social media users was accepted at ECIR 2016
- [22/09/2015] Paper that proposes a method for estimating user income levels from social media content and conducts a qualitative analysis on the outputs published in PLoS ONE
- [03/08/2015] Paper that significantly improves influenza-like illness prevalence modelling from search query logs published in Nature Scientific Reports [blog post]
- [02/07/2015] Paper that proposes a method for estimating the impact of a health intervention via user-generated web data published at Data Mining and Knowledge Discovery [blog post]
- [01/06/2015] $50,000 sponsorship by Google to advance research on digital disease surveillance
- [28/04/2015] Paper that presents a method for inferring the occupational class of a user based on multi-faceted Twitter activity accepted at ACL '15 [blog post]
- [2014/2015] Invited talks at the universities of Cambridge and Warwick [slides]
Research highlights Machine Learning; Natural Language Processing; User-Generated Content; Social Media; Big Data
In a series of papers, we show that social media content is predictive of the occupational class, income level and socioeconomic status of social media users. We additionally provide a qualitative analysis of the underlined patterns. Apart from the obvious commercial applications for these types of inference, it is noted that they could facilitate numerous downstream interdisciplinary research tasks. We have also made an effort to share the data sets that were used in our experiments (ACL '15 data set, PLoS ONE data set, ECIR '16 data set). Our work has been extensively covered by the media (e.g. The Washington Post, The Telegraph, Fortune).
Previous attempts to model influenza from the time series of search query frequencies (such as Google Flu Trends) ended up with high error rates in their estimations during recent flu seasons. In this paper, we revisit the modelling of influenza rates, proposing a novel nonlinear approach based on a composite Gaussian Process that operates on top of search query clusters. We also extend the proposed query-only models with an autoregressive component that incorporates our most recent knowledge about the flu prevalence in the population. Our analysis performs a rigorous experimentation that spans across 10 US flu seasons, reveals the pitfalls of the previous approach and provides a qualitative perspective for this research task. See my blog post for a brief summary.
We introduce a framework for evaluating the impact of a targeted intervention, such as a vaccination campaign against an infectious disease, through a statistical analysis of user-generated content submitted on web platforms. Using supervised learning, we derive a nonlinear regression model (composite Gaussian Process kernel) for estimating the prevalence of a health event in a population from Internet data. This model is applied to identify control location groups that correlate historically with the areas, where a specific intervention campaign has taken place. We then determine the impact of the intervention by inferring a projection of the disease rates that could have emerged in the absence of a campaign. Our case study focuses on the influenza vaccination program that was launched in England during the 2013/14 season, and our observations consist of millions of geo-located search queries to the Bing search engine and posts on Twitter.
Instead of just modelling word frequencies (or in general word characteristics) by learning a weight vector w, we are also learning a weight for each user (uT). Thus, from a linear regression model, we now go bilinear. This idea was applied for predicting voting intention polls from tweets in two countries (Austria and the UK), but it is also applicable on various other NLP tasks, such as the extraction of socioeconomic patterns from the news. You may download a beta version of Bilinear Elastic Net (BEN) for MATLAB; relevant slides are available.
What are the most important factors for determining user impact on Social Media platforms? Can we identify user actions that have a significant effect on their impact? In this work, we propose a set of nonlinear models based on Gaussian Processes for inferring user impact on Twitter. Our modelling is based on actions under the direct control of a user, including textual features such as word or topic (word-cluster) frequencies. Given the strong inference performance, we then dig further into our models and qualitatively analyse their properties from a variety of angles in an effort to discover the specific user behaviours that are decisive impact-wise. A brief summary of this work is given in this blog post.
What happens if we quantify affective expression in millions of books? We can probably identify periods with dominant emotions, extract temporal emotion patterns through the century and come up with interesting scenarios that may explain them (PLoS ONE, 2013). Additionally, we could explain those patterns by looking at their reflection in real-world tendencies such as indices about the main driving factor of the system we are living in, the economy (PLoS ONE, 2014).
An effort to assess the statistical robustness of the above findings together with comparative figures across different emotion detection tools are presented in this paper (Big Data '13).
Press releases: University of Bristol (1), University of Bristol (2), University of Sheffield
Media coverage: Nature, The Guardian, New York Times (1), New York Times (2), Slate, BBC Radio 4, Die Welt
Can we exploit text generated by Social Media (e.g. Twitter) users to quantify the magnitude of events, such as an infectious disease (e.g. flu) or even a rainfall by applying Machine Learning methods?
Press release: University of Bristol
Media coverage: ScienceDaily, Natural Hazards Observer, BBC Radio 4
Distinctions: EPSRC research highlight (2011), Most notable computing publications (2012)
This is the first work showing that Social Media can be used to track the level of an infectious disease, such as influnza-like-illness (ILI), in the population. To achieve that we collected geolocated tweets from 54 UK cities, used them in a regularised regression model which was trained and evaluated against ILI rates from the Health Protection Agency. Flu Detector is a demonstration that used (now stopped!) the content of Twitter for nowcasting the level of flu-like illness in several UK regions on a daily basis. We've recently came up with an improved visualisation of predicted flu rates from Twitter data, but it is still in its alpha version.
Press release: Computer Science Department, University of Bristol
Media coverage: MIT Technology Review, New Scientist
Mood of the Nation used (now stopped!) more than half a million geolocated tweets on a daily basis to detect mood and affect trends in the UK population focus on 4 categories of affect: joy, sadness, anger and fear. A simple assessment of those patterns reveals quite interesting results. Check this out for example!
Press release: University of Bristol
Media coverage: Mashable, New Scientist, Dradio, BBC World News