Skip to Main Content
Michigan State University

STT 200 Statistical Methods: Textmining

Learning Data analysis, probability models, random variables, estimation, tests of hypotheses, confidence intervals, and simple linear regression.

Text Mining

The discovery by computer of new, previously unknown information; by automatically extracting information from a usually large amount of different unstructured textual resources.

Freely Available Resources

Open Data Census
- A community effort, the OCDF data census seeks to aggregate public datasets related to social media and online communities.

SNAP: Stanford Network Analysis Project
- Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library.

Awesome Public Datasets
- An awesome list of high-quality open datasets (HQOD) in public domains (on-going).

Common Crawl
- Open repository of web crawl data that can be accessed and analyzed by anyone.

the @unitedstates project
- @unitedstates is a shared commons of data and tools for the United States. Made by the public, used by the public.

American Presidency Project - One of the most comprehensive collection of web resources on the American presidency, including documents, public papers, executive orders, addresses, press conferences, debates, election data, approval ratings, much more.

University of Oxford Text Archive
- The University of Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning.

Hathi Trust
- Non-Google digitized collection: Approximately 550,000 public domain volumes as of March 2015, primarily, though not exclusively, English language materials published prior to 1923.

JSTOR for Research
- Data for Research is a free service for researchers wishing to analyze content on JSTOR through a variety of lenses and perspectives.

PubMed Central Open Access Subset

Patents & Trademarks (Google Bulk Downloads)