This is a collection of datasets from social media platforms about COVID-19, mainly datasets generated from major social media platforms.
A brief description of each dataset is included, and more details can be found by clicking the links, including methods of data collection and licensing information for reuse.
Multilingual and comprehensive datasets
The Information Science Institute, University of Southern California
This is so far the most comprehensive and up-to-date Twitter dataset about COVID-19, starting from January 21, 2020. The datasets are updated daily. For each day, there are 24 separated text files, each file representing one hour's data, except for January 21, in which only two hours of data is available. File name is structured as "coronavirus-tweet-id-2020-mm-dd-hour", and it enables users to break down Twitter discussions by hours.
A list of keywords used to collect the tweets can be found here. See here for examples of published papers using this dataset.
A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration
Georgia State University
This is a multi-language collection of Twitter data (3/22/2020- present). Both the full dataset and the cleaned version (no retweets) are available. The most popular terms, bigrams and trigrams are also provided for NLP tasks in their GitHub repository.
GeoCoV19: a dataset of hundreds of millions of multilingual COVID-19 Tweets with Location Information
This large-scale multilingual Twitter dataset captures tweets about COVID-19 with geolocation information, extracted from the tweets’ location information and tweet content. It covers tweets from February 1, 2020 to May 1, 2020, with 218 countries, 47K cities in the world, and 62 different languages.
An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic
This is a comprehensive dataset for researchers to explore popular discourses about the COVID-19 pandemic. It contains daily tweets files (each file has tweets of an hour), starting from January 22, 2020. In addition to the tweets files, it also provides summary details, sentiment, summary hashtag, summary mentions, summary NER (and daily tops), and features table. It is well prepared as a training set for the study of “(mis)information diffusion, semantic networks, sentiment, and the evolution of COVID-19 discussions”.
Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations
Johns Hopkins University
This dataset contains COVID-related tweets with date, keywords, and inferred geolocation included. The daily data files are in json format.
TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19 Pandemic
TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. Metadata information about the tweets as well as extracted entities, sentiments, hashtags, user mentions, and resolved URLs are exposed in RDF using established RDF/S vocabularies.” The latest version is updated to April 2020.
COVID-19 misinformation datasets
Dataset for COVID-19 misinformation on Twitter
This dataset is used for a research article titled An Exploratory Study of COVID-19 Misinformation on Twitter. Two datasets are presented in this repository. There are two datasets in this repository. According to the authors, “the first dataset are the tweets which have been mentioned by fact-checking websites and are classified as false or partially false and the second dataset consists of COVID-19 tweets collected from publicly available corpus TweetsCOV19 (January-April 2020) and in-house crawling from May-July 2020.”
CoAID (Covid-19 heAlthcare mIsinformation Dataset) is a diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users' social engagement about such news. It includes 1,896 news, 183,564 related user engagements, 516 social platform posts about COVID-19, and ground truth labels.”
CMU-MisCov19: A Novel Twitter Dataset for Characterizing COVID-19 Misinformation
The dataset contains 4573 annotated tweets across 17 themes related to COVID-19. Their annotation codebook is also presented. This annotated dataset can be used for studies related to misinformation detection and characterization.
This dataset is collected from Sina Weibo, a Chinese microblogging platform similar to Twitter. This dataset was used to analyze COVID-related misinformation, starting from December 7, 2019 to April 4, 2020.
Social, Cultural and COVID Narratives
Digital narratives of COVID-19: a Twitter Dataset
University of Miami
This dataset collects bi-lingual (English, Spanish) Twitter data, and is unique in its geographic focus. It is a digital humanities project covering Southern Florida and Latin American countries, starting from late April 2020.
Twitter’s Term of Service