Interactive TOpic Model and MEtadata Visualization


About


About the Interface

TOME is a tool for humanities scholars, designed as an entrypoint into large collections of digitized text. It rests upon the technique of topic modeling, a machine learning technique for automatically identifying a set of topics--or themes--in a document set. Megan R. Brett provides a basic introduction to topic modeling here. Ted Underwood provides a slightly more technical overview here. And Scott Weingart provides some additional history of the technique, along with an account of some additional applications here.

For our part, we sought to leverage the affordances of topic modeling for a particular use: refining a large collection of documents into a smaller set of more relevant texts-- ideally, texts that a scholar would eventually read. In designing our interface, we drew from our knowledge of how humanities scholars usually begin the process of online research: with a keyword search. We also incorporated ideas about the value of exploratory data analysis, an approach to analyzing data that emphasizes iteration. As the scholar gains a better sense of what is in their dataset, and what questions they’d like to pursue, they can return to the original model and refine it, adjusting any available parameters and exploring the results. Here, scholars cannot refine the original topic model. But they can return to the original filtering mechanism at the top of the page, selecting additional relevant topics (or de-selecting less relevant ones) as they continue to refine the set of documents they plan to read.

TOME is a prototype interface, with the caveats that any prototype entails. In making our interface public, our goal is to offer an interactive example of how topic modeling-- and machine learning techniques more generally-- might be incorporated into the humanities research process; and to expand the conversation about their uses and their constraints.

About the Project

TOME began as a collaboration between Lauren Klein and Jacob Eisenstein, funded by an NEH Office of Digital Humanities Startup Grant. During the first phase of the project, from 2013 to 2015, Iris Sun (MS DM ‘14), Ana Smith (BS CS ‘15), and Catherine Roshelli (BS CM ‘15) worked on the project. The resultant paper, “ Exploratory Analysis for Digitized Archival Collections,” Digital Scholarship in the Humanities 30.1 (Fall 2015) describes that work.

In 2017, Klein returned to the project with a new project team: Adam Hayward (BS CS ‘20), Nikita Bawa (BS CS ‘18), Caroline Foster (MS HCI ‘17), and Morgan Orangi (MS DM ‘18). This interface is the result of that collaboration. It is described in more technical detail in “TOME: A Topic Modeling Tool for Document Discovery and Exploration,” Digital Humanities 2018 (Association of Digital Humanities Organizations, 2018).

Technical Details

Our corpus consists of over 300,000 documents drawn from a collection of nineteenth-century newspapers, focusing on issues including abolition and women’s rights. These papers include: The Christian Recorder, The Colored American, Douglass Monthly, Frank Leslie’s Weekly, Frederick Douglass Paper, Freedom’s Journal, Godey’s Lady’s Book, The Liberator, The Lily, The National Anti-Slavery Standard, The National Citizen and Ballot Box, The National Era, The North Star, The Provincial Freeman, The Revolution, and The Weekly Advocate.

The documents were scraped from Accessible Archives as per an agreement with Accessible. Additional data cleaning, as well as metadata creation, was performed through custom Python scripts.

The topic model of our corpus was created using gensim, the vector space and topic modeling library. We employed gensim’s wrapper for Latent Dirichlet Allocation (LDA) from MALLET. We generated 100 topics after 100 iterations, filtering the 100 most common words. We printed the topics and topical composition of each document to CSV files. We then ingested the data into a MySQL database using Django’s ORM framework.

The TOME interface was built with Django, which we chose because of its ORM capabilities. We also employ Ajax to pull additional data from the server in response to user interactions. Because of the number of some of the underlying calculations, the site sometimes experiences slow load times. Should we develop this from a prototype to a fully-functioning interface, additional optimizations-- or, perhaps, raw SQL-- will be required.

All of the code for this site, including the topic model and related processing scripts, can be found on GitHub.


Nineteenth-Century Newspapers

A topic model of 372,276 of articles from 16 newspapers
March 16, 1827 - June 17, 1922


Topics
  • SORT BY
  • Most Prevalent in Corpus
  • Selected at Top

Topics ranked by percentage of corpus