8 Digital Distant Reading: Machines reading text

Juan Pablo Alperin, Sophie Mackenzie, and Lakota Rich

Learning Objectives

  • How we can use data to understand reading behaviour and preference
  • How digital technology is applied for different forms of reading and in the digital humanities

Key Messages

  • That machine reading provides great possibilities, but has its limits
  • That digital reading can provide different insights and ways of representing those insights than other methods of reading and humanities work
  • That digital technology and the digital humanities encounter biases and the awareness of such biases and identification of means of addressing them are valuable
  • That our brains are learning and evolving along with advances in technology and digital reading

We now turn our attention to “distant reading”, a term coined originally by Franco Moretti, but that is commonly used to refer to the method of using digital tools and computational methods to process and analyze texts and metadata about texts. Because distant reading uses digital tools, many texts can be analyzed quickly in a way that it would be hard, if not impossible, for any individual to carry out.

Machine reading and the digital humanities

To read at a distance, we rely on machines to do the reading for us. Machine reading is the “automatic, unsupervised understanding of text” using software. Machine reading often involves “natural language processing” which is the application of machine learning to make sense of human language. As these techniques have become available, scholars from the humanities have begun employing them and, in doing so, have created an interdisciplinary field that combines digital technologies with methods and questions from the humanities.

A selection of terminology and methods in the digital humanities

N-gram: N-grams, at a basic level, are sequences of N words. “Dog,” one word, would be a unigram, while “brown dog” would be a bigram, and so on. N-grams can be used to determine the probability of a word appearing next in a sequence of words. N-grams are often used in Natural Language Processing, a component of artificial intelligence that allows computers to understand and talk like humans. Google Books offers an Ngram Viewer tool, while Randy Olson and Ritchie King created a Reddit Ngram Viewer.

Optical character recognition: Optical character recognition (OCR) is the process of converting images of printed or handwritten text into machine-encoded text, making it editable and searchable.

Stylometry: Stylometry is the study of measurable elements of linguistic style. Common applications of stylometry include determining the author(s) of a text, also known as authorship attribution, and identifying macro patterns of style throughout an author’s corpus, within certain genres, among groups of authors, and more.

Topic modeling: Topic modeling is a means of identifying, measuring, and tracking topics, or “statistical word clusters,” throughout texts. For example, applying topic modeling to the Pennsylvania Gazette, Sharon Block identified ‘runaways’ as a prevalent topic in the publication; the ‘runaways’ topic is linked to a cluster of words: away, reward, servant, named, feet, jacket, high, paid, hair, etc.

Type-token ratio: A type-token ratio (TTR) is the measure of the lexical diversity of a text. The TTR is calculated by dividing the number of types, or unique words, by the tokens, or total number of words.

Critiques of distant and machine reading connect to broader discussions within the digital humanities and beyond about the ways in which science, math, technology, and related fields are often treated as objective, obscuring the subjectivities—including those of race, gender, sexuality, and more—at work. Scholars like James E. Dobson have critiqued the notion of machine reading as unsupervised and automatic. Dobson suggests that such terminology erases the influence of the human researcher engaged in machine reading work: “there cannot be an automated reading of a text that is free of the ‘taint’ of subjectivity”. These critiques connect to biases we encounter when working with data, as well as machine and algorithmic biases.

Machine Reading and the publishing industry

While acknowledging these critiques, we can also see that how can machine reading and digital humanities methods be used in the publishing world. For example, insights obtained through machine reading can help identify traits that make books successful, therefore possibly eliminating the risk publishers take on when purchasing a manuscript. Distant reading can be used to extract information like language, story arch, theme, and characters can be read electronically and subsequently mapped to sales data to learn which characteristics lead to sales or, by mapping it onto readership data from programs like Jellybooks, to understand the characteristics that correlate with book completion rates.

The possibilities are not just for “Finding the Next Harry Potter.” By reading texts at a distance, patterns emerge, which can be used for other purposes, such as making book recommendations that are “speedier and more relevant.” Taken together, this method of reading can help inform the decisions publishers and content producers, not by replacing them, but rather by supporting the work they do.

Example:

Popular digital humanities research tools

TAPoR 3 is a portal for accessing a wide range of tools for text analysis and digital humanities work more broadly.

Voyant Tools is an open-source text analysis application that can identify word frequency, types, tokens, average words per sentence, and more. It also includes tools for visualizing this data, such as Cirrus, a word cloud, and the Type Frequencies Chart, a line graph of the distribution of terms over the corpus.

For more tools, see the SFU Library’s Digital Humanities research guide: tools and methods page.

Exercises

Question—Imagine that you are a publisher and had access to any data that is out there. What information would be beneficial to you and how could you get this information without impeding on personal privacy? How do you intend to use this information to guide your decision making?

Readings

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

The Web, Publishing, and Ourselves by Juan Pablo Alperin, Sophie Mackenzie, and Lakota Rich is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book