Andrew W. Mellon Foundation Funds Northeastern’s NULab to Study OCR for the Humanities

Northeastern University’s NULab for Texts, Maps, and Networks announces a grant from the Andrew W. Mellon Foundation to David Smith (Computer Science) and Ryan Cordell (English) to study the current state of optical character recognition (OCR) for historical and multilingual documents and to outline future directions for research in this area. Scholars and students of history, literature, and the social sciences now commonly rely on access to billions of scanned pages from Google Books, the Library of Congress, the Internet Archive, and other commercial or academic resources. Whether acknowledged or not, this means that scholars increasingly base research conclusions on the output of automatic OCR transcriptions of page images. At the same time, recent industrial and academic research on OCR has narrowed the gap between systems for recognizing clean print in the Roman alphabet, connected writing systems such as Arabic or devanagari, and manuscript text. Research in these fields, however, has had little overlap with the diverse problems of humanities data, such as historical languages, media, and scripts. Over the next year, NULab researchers will survey the state of the art in developing and adapting OCR for historical and multilingual documents, as well as in using this OCR output for humanistic research. By surveying experts across the academy, industry, and libraries; collecting benchmark datasets; and holding a workshop at Northeastern, the team will thoroughly document current practices and outline a common agenda for future research in OCR as it relates to the humanities.