The primary purpose of digital text collection is to provide faculty, students, and staff across all subject areas with access to text that is amenable to computational analysis. High-level modes of computational analysis include but are not limited to corpus linguistics methods, text mining, and data mining. Specific methods of analysis include but are not limited to sentiment analysis, named entity recognition, part of speech tagging, syntactic parsing, relationship extraction, network analysis, co-citation analysis, and topic modeling. Applying these methods to text supports discovery of patterns and relationships within and across text, and enables testing of hypotheses at scale. Amenability of text to computational analysis is predicated on format, availability of markup and/or metadata, and terms of use favorable to anticipated use.
Digital text are defined as singular and combined characters (e.x. alphanumeric, logographic) organized into semantically meaningful form, encoded digitally, and stored on a medium accessible to a computational device. Digital text manifests at the level of individual text, multiple texts, extracts from texts, and extends to the text that describes text. Prior to the addition of markup, digital text is unstructured. Digital text is structured when markup, often a variant of extensible markup language (XML), is used to explicitly assign values to components of the digital text. Both the digital text that are marked and the markup (metadata) are suitable objects of collection activity.
Computational analysis of text is motivated by increasing accessibility to digitized text resources, coupled with a contemporary environment in which academic, corporate, and personal knowledge production takes the form of born digital text. Attempts to analyze phenomena that occur in these spheres require access to digital text. Michigan State University interest in digital text resources can be gauged by curricular offerings and the presence of digital text oriented research agendas in Writing, Rhetoric and American Cultures, Linguistics, Political Science, Hindi Language, German Literature, Business, and Radiology. At the programmatic level, interest in digital text resources can be gauged by the creation of the College of Arts and Letters Undergraduate Degree in Experience Architecture, Undergraduate Specialization in Digital Humanities and Graduate Certificate in Digital Humanities, the College of Business Masters Degree in Business Analytics, the Department of Anthropology Cultural Heritage Informatics Fellows program, and the establishment and ongoing activity of digital scholarship centers like the College of Arts and Letters Creativity Exploratory, Matrix: The Center for Digital Humanities and Social Sciences, Writing in Digital Environments, the Sociolinguistics Lab, and the Digital Humanities and Literary Cognition Lab.
Digital text collection occurs in many subject areas. Readiness for computational analysis varies. Existing collection strengths lie in digital text held by Linguistics, English and American Literature, the local copy of the Hathitrust collection ( > 3 million volumes), and exemplary digital collections like Feeding America: The Historic American Cookbook Project.