Newspapers as Data: Context

Collage of front pages of historical Arizona newspapers

This goal of this lesson is to understand specific factors about using this newspaper data set in your research including:

  • Terminology used in historical newspapers
  • Newspaper's audience, ownership, and editorial stance
  • Broader context of newspaper publishing in Arizona
  • Scope of the newspapers
  • What Optical Character Recognition (OCR) is

Terminology used in historical newspapers

When doing any newspaper research, it's important to keep in mind that terminology has changed over time. Different words may have been used than what we use today, or terms may have had different meanings than they do now. For example, "World War I" wasn't referred to with those terms while it was happening. Historical newspapers also reflect the language and attitudes of their time and may contain sensitive, offensive, racist, and outdated terms and images.


Newspaper's audience, ownership, and editorial stance

Information that can help provide context when using newspapers for research includes:

  • Was a newspaper intended for a broad audience or a more focused readership? For example, The Apache Sentinel was published by and for the soldiers at Fort Huachuca.
  • Who owned the newspaper? Who were its editors?
  • What was the newspaper's editorial stance or political leaning, and did it change over time?
  • What was the economic interest of the newspaper? For example, The Bisbee Daily Review was owned for a time by the Phelps-Dodge mining company.

Some of this information can be found in the newspaper itself, such as in a slogan or subheading under the newspaper's title, or on the editorial page where the owners and editors names are listed.

Essays provided in  Chronicling America  also have information about the audience, ownership, editorial stance, and history of each newspaper.

 

Newspapers in Arizona

When using newspapers in your research, it’s important to consider the broader context of newspaper publishing. The eight newspapers are for particular periods of time, but these are a subset of the newspapers published in Arizona

For context, according to the Ayer Directory, which was an annual directory listing newspapers published in the United States, there were 68 newspapers and periodicals published in Arizona in 1915. This data set includes only 3 titles that were published in 1915: the Border Vidette, The Bisbee Daily Review, and El Tucsonense. 

Similarly, in 1959, there were over 100 newspapers published in Arizona, and this data set includes 4 that were published in that year: El Tucsonense, El Sol, Arizona Sun, Arizona Post.

Scope of the newspapers

Another consideration is that the newspapers were published at different frequencies and duration. For example, El Tucsonense was published for over 40 years, usually twice a week, whereas the Phoenix Tribune was published only once a week for about 13 years, and in its final years, only appeared about monthly.

Newspaper frequency and duration of publication

The newspapers also varied in length. Some were only 4 pages, some grew to be about 8 pages, and some were longer.

The eight newspapers described in the  Newspapers as Data: Descriptions and Locations  lesson represent roughly 90,000 pages. For context, the whole of Chronicling America is over 16 million pages.


In summary, this is context to keep in mind as you do your research and text mining. You want to know what’s included – and not included – and what the limitations of the content might be, to take into account when you do analysis and make inferences about the newspaper data:

  • Number of newspapers in this project compared to total number published in Arizona
  • Varied frequency and duration of publication
  • Varied length of newspaper issues
  • Total number of pages in this project compared to the total number in Chronicling America.

It's also important to consider the audience, ownership, and editorial stance of a newspaper, as well as its historical context and historical terminology, when conducting newspaper research.


Optical Character Recognition (OCR) and Newspapers

Another aspect of newspaper research that is important to understand is the OCR text that makes it possible to search digitized newspapers and use them for text mining.

What is OCR?

“Optical character recognition (OCR) is a fully automated process that converts the visual image of numbers and letters into computer-readable numbers and letters. Computer software can then search the OCR-generated text for words, phrases, numbers, or other characters. However, OCR is not 100 percent accurate...Although errors in the process are unavoidable, OCR is still a powerful tool for making text-based items accessible to searching.” ( Library of Congress )

The process of OCR allows a computer to turn a visual image – such as a page from a newspaper that has been scanned – into readable text. While OCR is not 100% accurate, it still enables useful searching of texts. It is what enables you to search through the full text of a document, like a newspaper.

Here is an example of a newspaper article from The Border Vidette next to the corresponding OCR text.

Most of the OCR text is correct, but there are a few errors.

Here is another example from the front page of El Sol. Again, most of the OCR text is correct, but there are some noticeable errors, such as the misspelling of California ("Calirornia") and the name Miguel ("Migue"), as well as the word periódicos being split into two words because of the line break where it was hyphenated as periódi- cos.

While OCR errors may seem significant, words are often repeated in newspaper stories, so if misspelled in one place, they might be correct in another and would be found in a search or text mining.

In the  Intro to Text Mining  lesson, you will learn more about text mining and counting word frequencies.

Newspaper frequency and duration of publication