Northeastern University announces a grant from the Andrew W. Mellon Foundation to study the current state of optical character recognition (OCR) for historical and multilingual documents and to outline future directions for research in this area. Scholars and students of history, literature, and the social sciences now commonly rely on access to billions of scanned pages from Google Books, the Library of Congress, the Internet Archive, and other commercial or academic resources. Whether acknowledged or not, this means that scholars increasingly base their research on automatic OCR transcriptions of page images. At the same time, recent industrial and academic OCR systems have narrowed the gap between transcribing text in the Roman alphabet, in connected writing systems such as Arabic or Devanagari, and in handwritten form. Research in these fields, however, has had little overlap with the diverse problems of humanities data, such as historical languages, scripts, and document layouts.
Over the next year, David Smith (Computer Science), Ryan Cordell (English), and other researchers at Northeastern’s NULab for Texts, Maps, and Networks will survey the state of the art in developing and adapting OCR for historical and multilingual documents, as well as in using this OCR output for humanistic research. By surveying experts across the academy, industry, and libraries; collecting benchmark datasets; and holding a workshop at Northeastern, the team will document current practices and outline an agenda for OCR research as it relates to the humanities.
This project, which emerged from collaborative conversations between the Mellon Foundation, the National Endowment for the Humanities, and the Library of Congress, will hopefully serve as a catalyst for further OCR research and help the community move forward best practices for digital collections using OCR data.