Digital Distant Reading: Machines reading text

Juan Pablo Alperin; sophiemac; Lakota Rich

8

Juan Pablo Alperin; sophiemac; and Lakota Rich

Learning Objectives

How we can use data to understand reading behaviour and preference
How digital technology is applied for different forms of reading and in the digital humanities

Key Messages

That machine reading provides great possibilities, but has its limits
That digital reading can provide different insights and ways of representing those insights than other methods of reading and humanities work
That digital technology and the digital humanities encounter biases and the awareness of such biases and identification of means of addressing them are valuable
That our brains are learning and evolving along with advances in technology and digital reading

We now turn our attention to “distant reading”, a term coined originally by Franco Moretti, but that is commonly used to refer to the method of using digital tools and computational methods to process and analyze texts and metadata about texts. Because distant reading uses digital tools, many texts can be analyzed quickly in a way that it would be hard, if not impossible, for any individual to carry out.

Machine reading and the digital humanities

To read at a distance, we rely on machines to do the reading for us. Machine reading is the “automatic, unsupervised understanding of text” using software. Machine reading often involves “natural language processing” which is the application of machine learning to make sense of human language. As these techniques have become available, scholars from the humanities have begun employing them and, in doing so, have created an interdisciplinary field that combines digital technologies with methods and questions from the humanities.

A selection of terminology and methods in the digital humanities

N-gram: N-grams, at a basic level, are sequences of N words. “Dog,” one word, would be a unigram, while “brown dog” would be a bigram, and so on. N-grams can be used to determine the probability of a word appearing next in a sequence of words. N-grams are often used in Natural Language Processing, a component of artificial intelligence that allows computers to understand and talk like humans. Google Books offers an Ngram Viewer tool, while Randy Olson and Ritchie King created a Reddit Ngram Viewer.

Optical character recognition: Optical character recognition (OCR) is the process of converting images of printed or handwritten text into machine-encoded text, making it editable and searchable.

Stylometry: Stylometry is the study of measurable elements of linguistic style. Common applications of stylometry include determining the author(s) of a text, also known as authorship attribution, and identifying macro patterns of style throughout an author’s corpus, within certain genres, among groups of authors, and more.

Topic modeling: Topic modeling is a means of identifying, measuring, and tracking topics, or “statistical word clusters,” throughout texts. For example, applying topic modeling to the Pennsylvania Gazette, Sharon Block identified ‘runaways’ as a prevalent topic in the publication; the ‘runaways’ topic is linked to a cluster of words: away, reward, servant, named, feet, jacket, high, paid, hair, etc.

Type-token ratio: A type-token ratio (TTR) is the measure of the lexical diversity of a text. The TTR is calculated by dividing the number of types, or unique words, by the tokens, or total number of words.

Critiques of distant and machine reading connect to broader discussions within the digital humanities and beyond about the ways in which science, math, technology, and related fields are often treated as objective, obscuring the subjectivities—including those of race, gender, sexuality, and more—at work. Scholars like James E. Dobson have critiqued the notion of machine reading as unsupervised and automatic. Dobson suggests that such terminology erases the influence of the human researcher engaged in machine reading work: “there cannot be an automated reading of a text that is free of the ‘taint’ of subjectivity”. These critiques connect to biases we encounter when working with data, as well as machine and algorithmic biases.

Machine Reading and the publishing industry

While acknowledging these critiques, we can also see that how can machine reading and digital humanities methods be used in the publishing world. For example, insights obtained through machine reading can help identify traits that make books successful, therefore possibly eliminating the risk publishers take on when purchasing a manuscript. Distant reading can be used to extract information like language, story arch, theme, and characters can be read electronically and subsequently mapped to sales data to learn which characteristics lead to sales or, by mapping it onto readership data from programs like Jellybooks, to understand the characteristics that correlate with book completion rates.

The possibilities are not just for “Finding the Next Harry Potter.” By reading texts at a distance, patterns emerge, which can be used for other purposes, such as making book recommendations that are “speedier and more relevant.” Taken together, this method of reading can help inform the decisions publishers and content producers, not by replacing them, but rather by supporting the work they do.

Example:

https://www.npr.org/player/embed/127211884/127303878

Popular digital humanities research tools

TAPoR 3 is a portal for accessing a wide range of tools for text analysis and digital humanities work more broadly.

Voyant Tools is an open-source text analysis application that can identify word frequency, types, tokens, average words per sentence, and more. It also includes tools for visualizing this data, such as Cirrus, a word cloud, and the Type Frequencies Chart, a line graph of the distribution of terms over the corpus.

For more tools, see the SFU Library’s Digital Humanities research guide: tools and methods page.

Exercises

Question—Imagine that you are a publisher and had access to any data that is out there. What information would be beneficial to you and how could you get this information without impeding on personal privacy? How do you intend to use this information to guide your decision making?

Readings

Phillips, Stephen. 2016. Can Big Data Find the Next ‘Harry Potter’? The Atlantic
Martineau, Kim. 2019. Finding a Good Read Among Billions of Choices. MIT News.
Karlis, Nicole. 2019. What reading 3.5 million books tells us about gender stereotypes. Salon.
Marche, Stephen. 2012. Literature Is not Data: Against Digital Humanities. LARB.
- Selisker, Scott and Syme, Holger S. 2012. In Defense of Data: Responses to Stephen Marche’s ‘Literature Is not Data’. LARB.
Neary, Lynn. 2016. Publishers’ Dilemma: Judge A Book By Its Data Or Trust The Editor’s Gut? NPR
Fischett, Mark. 2017. Great Literature Is Surprisingly Arithmetic. Scientific American.
Emerging Technology from the arXiv. 2016. Data Mining Reveals the Six Basic Emotional Arcs of Storytelling. MIT Technology Review.
Posner, Miriam. 2016. What’s Next: The Radical, Unrealized Potential of Digital Humanities. U Minnesota Press.
Willens, Max. 2018. Viral publishers see sharp engagement drops on Facebook. Digiday
Zhang, Sarah. 2015. The Pitfalls of Using Google Ngram to Study Language. Wired.
Ouellette, Jennifer. 2019. Tolkien was right: Scholars conclude Beowulf likely the work of single author. Ars Technica.
Alter, A & Russel, K. 2016, March 14. Moneyball for Book Publishers: A Detailed Look at How We Read. New York Times.

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

The Web, Publishing, and Ourselves Copyright © 2020 by Juan Pablo Alperin; sophiemac; and Lakota Rich is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.