Interactive TOpic Model and MEtadata Visualization


About


About the Interface

TOME is a tool for humanities scholars, designed as an entrypoint into large collections of digitized text. It rests upon the technique of topic modeling, a machine learning technique for automatically identifying a set of topics--or themes--in a document set. Megan R. Brett provides a basic introduction to topic modeling here. Ted Underwood provides a slightly more technical overview here. And Scott Weingart provides some additional history of the technique, along with an account of some additional applications here.

For our part, we sought to leverage the affordances of topic modeling for a particular use: refining a large collection of documents into a smaller set of more relevant texts-- ideally, texts that a scholar would eventually read. In designing our interface, we drew from our knowledge of how humanities scholars usually begin the process of online research: with a keyword search. We also incorporated ideas about the value of exploratory data analysis, an approach to analyzing data that emphasizes iteration. As the scholar gains a better sense of what is in their dataset, and what questions they’d like to pursue, they can return to the original model and refine it, adjusting any available parameters and exploring the results. Here, scholars cannot refine the original topic model. But they can return to the original filtering mechanism at the top of the page, selecting additional relevant topics (or de-selecting less relevant ones) as they continue to refine the set of documents they plan to read.

TOME is a prototype interface, with the caveats that any prototype entails. In making our interface public, our goal is to offer an interactive example of how topic modeling-- and machine learning techniques more generally-- might be incorporated into the humanities research process; and to expand the conversation about their uses and their constraints.

About the Project

TOME began as a collaboration between Lauren Klein and Jacob Eisenstein, funded by an NEH Office of Digital Humanities Startup Grant. During the first phase of the project, from 2013 to 2015, Iris Sun (MS DM ‘14), Ana Smith (BS CS ‘15), and Catherine Roshelli (BS CM ‘15) worked on the project. The resultant paper, “ Exploratory Analysis for Digitized Archival Collections,” Digital Scholarship in the Humanities 30.1 (Fall 2015) describes that work.

In 2017, Klein returned to the project with a new project team: Adam Hayward (BS CS ‘20), Nikita Bawa (BS CS ‘18), Caroline Foster (MS HCI ‘17), and Morgan Orangi (MS DM ‘18). This interface is the result of that collaboration. It is described in more technical detail in “TOME: A Topic Modeling Tool for Document Discovery and Exploration,” Digital Humanities 2018 (Association of Digital Humanities Organizations, 2018).

Technical Details

Our corpus consists of over 300,000 documents drawn from a collection of nineteenth-century newspapers, focusing on issues including abolition and women’s rights. These papers include: The Christian Recorder, The Colored American, Douglass Monthly, Frank Leslie’s Weekly, Frederick Douglass Paper, Freedom’s Journal, Godey’s Lady’s Book, The Liberator, The Lily, The National Anti-Slavery Standard, The National Citizen and Ballot Box, The National Era, The North Star, The Provincial Freeman, The Revolution, and The Weekly Advocate.

The documents were scraped from Accessible Archives as per an agreement with Accessible. Additional data cleaning, as well as metadata creation, was performed through custom Python scripts.

The topic model of our corpus was created using gensim, the vector space and topic modeling library. We employed gensim’s wrapper for Latent Dirichlet Allocation (LDA) from MALLET. We generated 100 topics after 100 iterations, filtering the 100 most common words. We printed the topics and topical composition of each document to CSV files. We then ingested the data into a MySQL database using Django’s ORM framework.

The TOME interface was built with Django, which we chose because of its ORM capabilities. We also employ Ajax to pull additional data from the server in response to user interactions. Because of the number of some of the underlying calculations, the site sometimes experiences slow load times. Should we develop this from a prototype to a fully-functioning interface, additional optimizations-- or, perhaps, raw SQL-- will be required.

All of the code for this site, including the topic model and related processing scripts, can be found on GitHub.


Nineteenth-Century Newspapers

A topic model of 372,276 of articles from 16 newspapers
March 16, 1827 - June 17, 1922


Topics
  • SORT BY
  • Most Prevalent in Corpus
  • Selected at Top

  1. thy, thou, thee, er, love
  2. church, rev, pastor, sunday, work
  3. rev, conference, bishop, district, presiding
  4. party, democratic, state, vote, election
  5. church, conference, general, bishops, bishop
  6. god, lord, christ, jesus, spirit
  7. god, freedom, land, liberty, blood
  8. subject, question, opinion, public, regard
  9. flowers, sweet, summer, beautiful, bright
  10. heart, love, mother, child, life
  11. john, james, william, smith, thomas
  12. present, means, plan, purpose, success
  13. states, constitution, state, congress, united
  14. army, general, war, gen, soldiers
  15. antislavery, slavery, abolitionists, society, liberty
  16. life, world, soul, nature, mind
  17. society, meeting, friends, antislavery, held
  18. boston, massachusetts, phillips, antislavery, garrison
  19. silk, black, dress, white, fig
  20. moral, social, human, nature, mind
  21. cents, price, frank, copies, illustrated
  22. thing, things, give, put, thought
  23. fire, killed, murder, death, shot
  24. public, conduct, character, false, spirit
  25. st, evening, clock, sunday, saturday
  26. eyes, face, hand, stood, night
  27. book, author, work, volume, story
  28. money, dollars, pay, paid, hundred
  29. friends, hope, feel, friend, work
  30. fact, true, question, facts, point
  31. church, christian, religious, religion, churches
  32. paper, editor, press, article, papers
  33. slave, slavery, slaves, free, freedom
  34. house, morning, found, left, told
  35. resolved, convention, committee, meeting, resolutions
  36. meeting, evening, hall, audience, speech
  37. court, law, case, judge, state
  38. slavery, slave, freedom, regard, question
  39. cure, dr, remedy, diseases, pills
  40. mind, feelings, moment, felt, found
  41. war, country, nation, government, peace
  42. don, ll, thought, room, boy
  43. character, high, honor, success, position
  44. south, union, north, southern, slavery
  45. law, laws, human, rights, god
  46. number, year, states, population, thousand
  47. state, ohio, virginia, county, kentucky
  48. colored, white, negro, race, black
  49. chain, row, work, round, plain
  50. water, found, surface, iron, small
  51. lady, young, ladies, gentleman, dinner
  52. house, bill, senate, committee, senator
  53. school, children, schools, education, young
  54. work, labor, poor, home, give
  55. face, miss, eyes, love, mrs
  56. city, feet, house, building, street
  57. president, secretary, washington, office, general
  58. business, company, york, bank, cent
  59. water, river, feet, mountain, miles
  60. york, leslie, weekly, office, notice
  61. horse, head, dog, horses, feet
  62. york, american, world, game, club
  63. york, broadway, street, st, machine
  64. letter, sir, dear, friend, letters
  65. french, france, german, russia, germany
  66. ship, captain, board, vessel, boat
  67. address, cents, free, send, agents
  68. street, philadelphia, goods, store, hand
  69. table, page, prize, photographs, photograph
  70. death, life, died, dead, age
  71. year, hundred, days, months, week
  72. put, water, half, sugar, butter
  73. states, united, texas, mexico, government
  74. health, food, disease, body, patient
  75. wife, children, father, mother, young
  76. dr, college, university, professor, students
  77. music, theatre, musical, stage, play
  78. york, city, railroad, train, canada
  79. cotton, corn, market, wheat, farmer
  80. read, words, language, books, english
  81. brown, thatthe, bad, ofthem, intothe
  82. art, picture, beautiful, artist, pictures
  83. england, america, american, london, british
  84. land, california, miles, indians, indian
  85. africa, west, india, island, native
  86. lady, book, number, ladies, christmas
  87. ancient, rome, king, roman, egypt
  88. car, motor, machine, work, miles
  89. mrs, miss, mary, sarah, lucy
  90. women, woman, suffrage, rights, mrs
  91. boston, tribune, journal, hair, york
  92. war, states, government, lincoln, president
  93. temperance, drink, city, wine, liquor
  94. ohio, mass, newyork, philadelphia, standard
  95. lady, replied, sir, young, exclaimed
  96. de, king, paris, prince, french
  97. hand, harry, enter, hands, martin
  98. ll, don, em, ve, de
  99. ye, give, long, en, true
  100. face, eyes, back, love, don

Topics ranked by percentage of corpus