An elephant in the room: Twitter samplingmethodology.


Statistical learning
An elephant in the room: Twitter samplingmethodology.
Saturday 24th February 2018
Calissano, A.; Vantini, S.; Arnaboldi, M.
Download link:
The usage of social media data is spreading among the broad scientific community: 30000 papers dealing with this type of data are indexed in Scopus in the last decade. On the one hand, this data are very appealing, creating a rich bucket of information. On the other one, gathering them through a repeatable sampling strategy is increasing in complexity (or maybe becoming impossible?). The aim of this paper is to map the scientific community awareness about the sampling strategies used to download on-line data, focusing on the most studied social media: Twitter. This review unveils two unexpected results: the downloaded data are typically far from being randomly sampled, and around 99% of papers does not explicitly declare the sampling strategy used to download the data. These two facts pose some worrisome doubts about the trustworthiness of all the results presented in this stream of literature.