1. Home
  2. Data
  3. Introduction to Textual Analysis
  1. Home
  2. Modules
  3. Introduction to Textual Analysis

Introduction to Textual Analysis

This module introduces techniques for text and data mining that can enable different ways of close and distant reading across a collection, or corpora, of texts. The term “distant reading” was coined by Franco Moretti, as a means for using computational analysis to find and visualize patterns in language used across many texts that may be difficult to see when reading each text individually. When dealing with large collections of digital-first texts (i.e., electronic correspondence), it might be impossible for one person to closely read and review each piece. Digital textual analysis can be particularly helpful for studying the following: the meaning of words and documents; how words change over time; frequency of a term over time; concordance to a corpus; named entity recognition; text reuse; semantics of documents; and the semantics of words.


  1. Increased understanding of textual analysis techniques and why they are beneficial for certain types of research.
  2. Ability to assemble a data set appropriate for textual analysis.
  3. Ability to engage in quick textual analysis techniques using free online tools (Bookworm and Voyant Tools).


Questions to Consider

  • Cohen’s data set is comprised only of the titles of nineteenth-century works. What are the advantages and disadvantages of using that as his data set? How do you think having the full text might change the results?
  • Riddell’s analysis concerns trends over time in a subfield. Would you expect to see similar trends in your field, particularly those presented in figure 3.8?
  • Compare Riddell’s figures 3.8 and 3.9. Does changing the scale of the vertical axis change your reading of the frequencies?
  • How do the differences in Cohen and Riddell’s corpora (primary sources vs. secondary literature) inform their approaches and analyses? Do you think that one of these corpora or types of sources is better suited to topic modeling than the other?
  • Based on Brett’s discussion of topic modeling, where in the progress of a project do you think it is best to conduct textual analysis?


Activity 1:

Try using Bookworm to find rhetorical trends in digitized texts found in Open Library and Google Books with this tutorial: https://labs.ssrc.org/dds/articles/try-bookworm/. Bookworm identifies word frequencies over time. By comparing words/terms throughout a corpus of texts, it is possible to trace word use and changes over time, or see when particular terms enter the English language lexicon.

Activity 2:

Use Voyant Tools to analyze a corpora of texts to examine word frequency, a corpus grid, corpus summary, and keyword in context analysis: https://labs.ssrc.org/dds/articles/voyant-tutorial/.

Project Lens

Take a look at either Ben Schmidt’s “State of the Union in Context” at http://benschmidt.org/poli/2015-SOTU or Lindsay King and Peter Leonard’s “Robots Reading Vogue” at http://dh.library.yale.edu/projects/vogue/.

  • Look for the data the authors are using. Would you consider this open data?
  • Is any of this analysis replicable? (“Robots Reading Vogue” uses different methods—are some more easily done than others?)
  • Are the results of the textual analyses presented in these projects legible to someone who is not already an expert? (This might help: Schmidt, Benjamin and Mitch Fraas, “The Language of the State of the Union,” The Atlantic. January 18, 2015. http://www.theatlantic.com/politics/archive/2015/01/the-language-of-the-state-of-the-union/384575/.)
Updated on August 1, 2018

Related Articles