Optical Character Recognition Confidence Count (OCRCC)

Status:
The Access Services Department of the Howard W. Hunter Law Library is close to completing the first test set of more than 300 records. We still need to confidence a linguists to help establish the standards for any proposed method.

Goal:
Develop a Confidence Score for OCR’d text within the BYU Law Corpora

Background: With the exception of Text Coding Project text that has been reviewed by humans, the majority of text sources have employed some sort of optical character recognition to develop the underlying text use in corpora. The accuracy of text can depend on variables such as the quality of the image, the font used in the original, the spelling that was accepted when the original text was created, and the OCR algorithm used when images were processed. Developing metadata that attempts to represents a value of text accuracy may allow researchers to control the relative accuracy of the materials used for analysis.