Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Michigan State University

COVID-19 Social Media Datasets: Home

Introduction

This is a collection of datasets from social media platforms about COVID-19, mainly datasets generated from major social media platforms. 

A brief description of each dataset is included, and more details can be found by clicking the links, including methods of data collection and licensing information for reuse. 

Librarian

Datasets

Multilingual and comprehensive datasets

COVID-19-TweetIDs

The Information Science Institute, University of Southern California

This is so far the most comprehensive and up-to-date Twitter dataset about COVID-19, starting from January 21, 2020. The datasets are updated daily. For each day, there are 24 separated text files, each file representing one hour's data, except for January 21, in which only two hours of data is available. File name is structured as "coronavirus-tweet-id-2020-mm-dd-hour", and it enables users to break down Twitter discussions by hours. 

A list of keywords used to collect the tweets can be found here. See here for examples of published papers using this dataset. 

A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration

Georgia State University

This is a multi-language collection of Twitter data (3/22/2020- present). Both the full dataset and the cleaned version (no retweets) are available. The most popular terms, bigrams and trigrams are also provided for NLP tasks in their GitHub repository. 

GeoCoV19: a dataset of hundreds of millions of multilingual COVID-19 Tweets with Location Information

This large-scale multilingual Twitter dataset captures tweets about COVID-19 with geolocation information, extracted from the tweets’ location information and tweet content. It covers tweets from February 1, 2020 to May 1, 2020, with 218 countries, 47K cities in the world, and 62 different languages.

An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic

This is a comprehensive dataset for researchers to explore popular discourses about the COVID-19 pandemic. It contains daily tweets files (each file has tweets of an hour), starting from January 22, 2020. In addition to the tweets files, it also provides summary details, sentiment, summary hashtag, summary mentions, summary NER (and daily tops), and features table. It is well prepared as a training set for the study of “(mis)information diffusion, semantic networks, sentiment, and the evolution of COVID-19 discussions”. 

Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations

Johns Hopkins University

This dataset contains COVID-related tweets with date, keywords, and inferred geolocation included. The daily data files are in json format. 

TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19 Pandemic

TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. Metadata information about the tweets as well as extracted entities, sentiments, hashtags, user mentions, and resolved URLs are exposed in RDF using established RDF/S vocabularies.” The latest version is updated to April 2020. 

 

COVID-19 misinformation datasets

Dataset for COVID-19 misinformation on Twitter

This dataset is used for a research article titled An Exploratory Study of COVID-19 Misinformation on Twitter. Two datasets are presented in this repository. There are two datasets in this repository. According to the authors, “the first dataset are the tweets which have been mentioned by fact-checking websites and are classified as false or partially false and the second dataset consists of COVID-19 tweets collected from publicly available corpus TweetsCOV19 (January-April 2020) and in-house crawling from May-July 2020.”

CoAID

CoAID (Covid-19 heAlthcare mIsinformation Dataset) is a diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users' social engagement about such news. It includes 1,896 news, 183,564 related user engagements, 516 social platform posts about COVID-19, and ground truth labels.” 

CMU-MisCov19: A Novel Twitter Dataset for Characterizing COVID-19 Misinformation

The dataset contains 4573 annotated tweets across 17 themes related to COVID-19. Their annotation codebook is also presented. This annotated dataset can be used for studies related to misinformation detection and characterization. 

Weibo COVID Dataset

This dataset is collected from Sina Weibo, a Chinese microblogging platform similar to Twitter. This dataset was used to analyze COVID-related misinformation, starting from December 7, 2019 to April 4, 2020. 

 

Social, Cultural and COVID Narratives

Digital narratives of COVID-19: a Twitter Dataset

University of Miami

This dataset collects bi-lingual (English, Spanish) Twitter data, and is unique in its geographic focus. It is a digital humanities project covering Southern Florida and Latin American countries, starting from late April 2020. 

 

Ethical use of social media data: resources

Twitter’s Term of Service

Michigan State University