Introduction to Text Mining

Word Frequencies and Relative Word Frequencies Over Time for Newspapers as Data

Objective

By the end of this lesson you will be able to:

  • Describe the basic concept of text mining,

  • Use a simple exploratory tool to conduct text mining on a chosen collection,

  • Examine the outputs of two common text mining approaches, and

  • Assess two common text mining methods for use.

Text Mining 101

A  brief definition:

Text mining is the process of using computers to analyze and discover knowledge within a text or collection of texts.

We rely on the results of text mining every time we conduct a search online. Search results are the outputs of text mining processes that are run against web pages to determine how relevant a given page is to your query. In the case of search, we rarely see what's going on behind the scenes, but understanding the basics of how text mining works can help us understand how such algorithms shape our every day lives.

Knowing a bit about how we can use computation to conduct analysis can also help us ask different kinds of research questions. Instead of conducting a search and returning a list of potentially relevant results, perhaps you want to investigate a large amount of text by asking comparative questions or tracking patterns over time. For example:

  • What were the most common terms or phrases in Spanish-speaking border newspapers between 1915 and 1920?

  • How do the use of certain terms and phrases in borderlands newspapers shift before and after each world war?

  • What is the relative frequency of coverage on Japanese Internment across newspapers in different communities across the borderlands?

As researchers, we conduct text mining by extracting information from a selection of texts so that we can gather different kinds of evidence in pursuit of our research interests.

Text Mining Demystified

Text mining is really just counting! In its simplest form, we use computers to count how frequently words appear in a given text (  Nelson & Roland, 2017  ). Depending on what you are trying to accomplish, the counting and sorting can become more complex, but we'll mostly be using text mining to count word frequencies across time to consider historical trends and patterns.

Let's look at an example. In the image below we see a table where each row indicates a single word and the number of times it was counted in the text. The highlighted word minería (Spanish for mining) appears 11 times in the collected volumes of El Tucsonense between 1915 and 1929. It is as common as the surrounding words, including mezcla (Spanish for mixture) and neutralidad (Spanish for neutrality).

The blue line to the right demonstrates how frequently the word appears across volumes. Trends across time will be discussed in a later section of this introduction.

Screen capture of a word frequency table produced using digitized volumes of El Tucsonense and voyant-tools.org. An interactive version of this word table is available  here  .

Text Mining Requires Interpretive Decision Making

That's where you come in!  For your results to be useful and meaningful in relation to your research question, you'll want to think carefully about:

In the Border Hub, you can choose among a selection of newspapers, you may choose to count and compare several terms, and you are likely to analyze patterns in their usage over time. You'll want to think carefully about the historical, linguistic, and cultural contexts of the newspapers you choose. You'll also want to think carefully about the terms you select and whether they carry multiple meanings or are historically appropriate to the period you are analyzing.

Case Method 1: Word Frequencies

You've already seen an example of a word frequency table above, but what, exactly, is happening computationally to produce that table? Using a scripting language like Python or R, we write a program that reads in one or more text files. Or to put that another way, we use a computer program to access many newspapers at the same time. Next, the program splits up the text of the newspapers into a list of individual words, sometimes called unigrams. The program creates an empty table, and proceeds to move through the list of words. For each word in the list, the program checks the table to see if it has already been added. If the word is not in the table, it adds the word and counts it as the first instance. If the word is in the table, it increases the instance count by one. This continues until the program has processed every word in the list. Below is a step-by-step overview of the process:

Explanation of how a programming script processes text to generate a word frequency table.

STOP

...BUT some words are so common in a given language (or in a given type of text) that they aren't all that helpful for analysis. We can make a list of these words (called stop words) and stop the script from including these terms in our frequency tables or other outputs. Let's look at an example from the   Phoenix Tribune on October 25, 1919  :

Image from the Phoenix Tribune newspaper

By removing the most common words, we are left with the following: coming, fall, changeable, somewhat, disagreeable, weather, minds, revert, year, ago, world, grip, dreaded, pandemic, influenza, statistics, show, people, died, disease, United States, Americans, killed, shot, shrapnel, Hun.

Passage of text with stop words struck through.

The remaining terms primarily relate to weather, time, illness, war, and nationality. These broader themes could prove quite useful in understanding how the 1918 influenza pandemic was being characterized in newspapers as readers braced for the coming flu season.

Common Stop Words in English Include...

 ["a", "about", "above", "across", "after", "again", "against", "all", "almost", "along", "already", "also", "although", "always", "am", "among", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "because", "been", "before", "being", "beside", "between", "both", "bottom", "but", "by", "call", "can", "cannot", "can't", "could", "couldn't", "did", "didn't", "do", "does", "doesn't", "don't", "done", "down", "during", "each", "either", "else", "enough", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "for", "from", "front", "further", "get", "give", "go", "had", "has", "hasn't", "have", "he", "her", "here", "hers", "herself", "him", "himself", "his", "how", "however", "i", "if", "in", "indeed", "into", "is", "it", "its", "itself", "last", "least", "made", "many", "may", "me", "might", "mine", "more", "most", "mostly", "move", "much", "must", "my", "myself", "neither", "never", "next", "no", "nobody", "none", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "put", "rather", "same", "see", "seem", "she", "should", "since", "so", "some", "somehow", "someone", "something", "somewhere", "still", "such", "take", "than", "that", "the", "their", "them", "themselves", "then", "there", "these", "they", "thing", "this", "those", "though", "through", "throughout", "to", "together", "too", "toward", "under", "until", "up", "upon", "us", "very", "was", "we", "well", "were", "what", "whatever", "when", "whenever", "where", "wherever", "whether", "which", "while", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"]

Common Stop Words in Spanish Include...

["al", "alguna", "algunas", "alguno", "algunos", "algún", "ambos", "ampleamos","ante", "antes", "aquel", "aquellas", "aquellos", "aqui", "arriba", "atras", "bajo", "bastante", "bien", "cada", "cierta", "ciertas", "ciertos", "como", "con", "conseguimos", "conseguir", "consigo", "consigue", "consiguen", "consigues", "cual", "cuando", "de", "del", "dentro", "donde", "dos", "el", "ellas", "ellos", "empleais", "emplean", "emplear", "empleas", "empleo", "en", "encima", "entonces", "entre", "era", "eramos", "eran", "eras", "eres", "es", "esta", "estaba", "estado", "estais", "estamos", "estan", "estoy", "fin", "fue", "fueron", "fui", "fuimos", "gueno", "ha", "hace", "haceis", "hacemos", "hacen", "hacer", "haces", "hago", "incluso", "intenta", "intentais", "intentamos", "intentan", "intentar", "intentas", "intento", "ir", "la", "largo", "las", "lo", "los", "mientras", "mio", "modo", "muchos", "muy", "nos", "nosotros", "o", "otro", "para", "pero", "podeis", "podemos", "poder", "podria", "podriais", "podriamos", "podrian", "podrias", "por", "por qué", "porque", "primero desde", "puede", "pueden", "puedo", "que", "quien", "sabe", "sabeis", "sabemos", "saben", "saber", "sabes", "se", "ser", "si", "siendo", "sin", "sobre", "sois", "solamente", "solo", "somos", "soy", "su", "sus", "también", "teneis", "tenemos", "tener", "tengo", "tiempo", "tiene", "tienen", "todo", "trabaja", "trabajais", "trabajamos", "trabajan", "trabajar", "trabajas", "trabajo", "tras", "tuyo", "ultimo", "un", "una", "unas", "uno", "unos", "usa", "usais", "usamos", "usan", "usar", "usas", "uso", "va", "vais", "valor", "vamos", "van", "vaya", "verdad", "verdadera cierto", "verdadero", "vosotras", "vosotros", "voy", "y", "yo"]

Lists of common stop words for many other languages are easy to find with a simple Google search. Many projects start with a common list like the one above and add more terms that are specific to the texts that they are analyzing. For example, the word "tribune" probably shows up a lot in an analysis of the paper from the example above because of its title. For another illustration, go back to the   interactive word frequency table   we created from El Tucsonense. The word "Tucson" shows up 2,066 times!

If you take the time to look at digitized copies of these papers, you might notice other terms like the street address where the paper is published or similar information that shows up consistently in every issue. You may choose to expand your list of stop words before you begin analysis, and you may also make adjustments after your initial analysis turns up surprising results. This process, like much of the work of text analysis, may mean alternating between close readings of sample texts from the collection and interpretation of computational outputs over several iterations.

Putting It Together: Counting Words in Voyant

If you would like to try this out for yourself, you can!  This is completely optional, but if you're curious:

  • Go to Voyant:   https://voyant-tools.org/  

  • Open one of the pre-loaded collections (i.e., the plays of William Shakespeare or the novels of Jane Austen)

  • Look at the first module in the upper left corner

  • Toggle between the "Cirrus" option and the "Terms" option to compare the usefulness of word clouds versus frequency tables

  • See if you can find and edit the list of stop words that Voyant uses (warning: it's a little tricky and you may need to click around a bit)

  • Consider how you might want to use word frequencies for your own projects

Case Method 2: Relative Word Frequencies

A further layer of complexity can be added by analyzing the frequency of selected words over a period of time. The image here tracks trends for two selected words in the Bisbee Review for the entirety of 1917, a year notable for events associated with the  Bisbee Deportation  in which striking mine workers were illegally kidnapped and deported. Here we see the word "strike" (along with variants such as strikers and strikebreaker) in blue and the word "deport" (along with variants such as deported and deportation) in green. These are charted as a relative frequency (y-axis) for each issue of the paper (x-axis).

The counting here becomes a bit more complicated because for each issue we want to know how frequently our selected words appear relative to the total number of words in the issue. To establish the relative frequency we divide the number of times the selected term appears by the total number of words in the issue. Relative frequencies allow us to account for variations in length from issue to issue, and this proportional approach helps us to compare frequencies across issues. This is why we see the relative frequency represented as a number between 0 and 1.

In the example provided here, we see a profound increase in discussions of the strike beginning on June 28, 1917 (coded here as 19170628 in a year-month-day format) with discussions of deportation beginning around July 6, peaking on July 13, and continuing at an elevated rate through the summer and into the fall. You can explore a more interactive version of this chart  here .

Common Tools for Analyzing Relative Frequencies over Time

There are a variety of useful web-based tools for analyzing word frequencies over time, and the following are freely available online if you would like to explore the concept further:

  • The  Google NGram Viewer  reveals trends across the entire corpus of the Google Books digitization initiative. You can search for phrases up to five words long, adjust the time period for your analysis, and subset the corpus based on language. A set of advanced search features also allow for more sophisticated distinctions, such as indicating whether you are interested in the search term mine as a noun, a verb, or a pronoun.
  •   HathiTrust + Bookworm  is a similar tool developed to explore trends across the entire collection of HathiTrust's digitized texts. You can use metadata from the HathiTrust catalog records to filter and refine your search, and by clicking on the trend line you can drill down in your results to a list of the volumes from each year that contain your selected term.
  •  Voyant Tools  packages multiple text analysis components together in a single interface and includes relative word frequency over time. You can choose from two pre-curated collections or upload your own text corpora for analysis, though there are comparatively fewer options for fine tuning your parameters.

Each of these tools are excellent for conducting exploratory analysis, but no one tool exactly matches our local use case for analyzing borderlands newspapers. Researchers using this site are conducting analyses of the same basic dataset, each with a different set of interests and disciplinary perspectives. To accommodate the needs of multiple users, we want to be able to import texts for analysis without asking people to download datasets or install specialized software, and we also want to process our texts with an eye toward the idiosyncrasies of newspaper data. We want you to be able to:

  • Filter your analysis by newspaper title and date range,

  • Choose whether to conduct page-level or volume-level analysis,

  • Conduct comparative analysis across two different languages, and

  • Gain useful data literacy skills along the way!

Welcome to Jupyter Notebooks!

Jupyter Notebooks is a web application for interactive computing. We like it because it helps us control and contextualize the text mining experience so that all users meet the same basic learning objectives while also giving them a path forward to further experimentation. You will be using an online 'notebook' created in Jupyter Notebooks for your investigation. Each notebook includes pre-set modules of computer code alongside explanations and descriptions in plain language to help describe each step in the process of text analysis. You can run and edit the code directly -- even without any prior programming experience. By using Jupyter Notebooks, you get a little more exposure to what's happening behind the scenes and how text mining works in practice. In another part of the Border Hub, you have access to a Jupyter Notebook to analyze the newspapers included in this website.

Optional Further Readings

William J. Turkel and Adam Crymble, "Counting Word Frequencies with Python," The Programming Historian 1 (2012),   https://programminghistorian.org/en/lessons/counting-frequencies  .

Quinn Dombrowski, Tassie Gniady, and David Kloster, "Introduction to Jupyter Notebooks," The Programming Historian 8 (2019),   https://programminghistorian.org/en/lessons/jupyter-notebooks  .

Screen capture of a word frequency table produced using digitized volumes of El Tucsonense and voyant-tools.org. An interactive version of this word table is available  here  .

STOP

Image from the Phoenix Tribune newspaper

Passage of text with stop words struck through.