OCR Data

What is OCR?

Optical character recognition (OCR) is a fully automated process that converts the visual image of numbers and letters into computer-readable numbers and letters. Computer software can then search the OCR-generated text for words, phrases, numbers, or other characters. However, OCR is not 100 percent accurate, and, particularly if the original item has extraneous markings on the page, unusual text styles, or very small fonts, the searchable text OCR generates will contain errors that cannot be corrected by automated means.

Although errors in the process are unavoidable, OCR is still a powerful tool for making text-based items accessible to searching. For example, important concept words often appear more than once within an article. Therefore, if OCR misreads one instance of a key word in a passage, but correctly reads the second instance, the passage will still be found in a full-text search.

To enable research and external services the Texas A&M University Newspaper Collection provides bulk access to its OCR data. The table below itemizes a list of data files available for download. Each file will decompress into directory structure that lets you easily map the OCR file to the URL identifier for that page. For example a file such as sn86088544/1893-10-01/ed-1/seq-1/ocr.txt maps to the URL http://newspaper.library.tamu.edu/lccn/sn86088544/1893-10-01/ed-1/seq-1/.

If you are interested in automated access to this data you may want to use the Atom and JSON versions of this table.

Filename Batch Created Size SHA-1 Checksum
batch_txa_batt_ver58.tar.bz2 batch_txa_batt_ver58 2017-07-11T10:33:00-05:00 165 bytes 3426a078d57f1b8edd2f34f7413e0195