The Universitat Politècnica de València (UPV), through the center "Pattern Recognition and Human Language Technologies" (PHRLT), is a partner of READ, a European project whose aim is to develop new advanced tools to automatically transcribe and index historical manuscripts. This project is funded by the EU Program Horizon2020 and will last three and a half years.
This project will allow researchers to access transcripts of documents from 14th century to the present, including Lope de Vega’s manuscripts, belonging to Biblioteca Nacional de España , letters from Grimm Brothers, belonging to the Marburg State Archives, and a vast amount of documents about the history of Venice gathered during hundreds of years.
Documents with a great value
"Those are probably the most remarkable documents but we also aim to make a trove of civil documents, such as marriage, birth and death records , court decision , etc., available to researchers, historians, linguists, genealogists and the public in general, since these documents have a great value for demographic and genealogical studies," Joan Andreu Sánchez explains, researcher from the PRHLT center at the UPV.
This project researches with documents from countries such as Spain, Italy, Germany, United Kingdom, the Netherlands and Finland. In addition, it will allow us to transcribe original manuscripts written in Latin, German, Dutch, English, Castilian Spanish, Italian and Finnish.
"The idea is that, in the future, libraries and archives will be able to provide access to this content so that people can look inside the documents not just at the metadata, as is done today," Joan Andreu Sánchez says.
Automatic learning
According to the PHRLT’s researcher , one of the problems with the historical documents is the lack of standard writing rules and edition patterns, so variability is huge. Since characters can't be automatically isolated, these documents can't be transcribed using OCR techniques. For that reason, recognition must be based on holistic techniques that recognize characters, words and sentences as "a whole".
"There are documents with annotations in the margins, added words between lines , crossed out words, texts with many abbreviations, high variability in the type of writing, etc. The project seeks to process all this heterogeneity and make this information accessible by transcribing or indexing it using new tools," Joan Andreu Sánchez explains.
Thus, READ partners are now working on new Handwritten Text Recognition (HTR) solutions that will be incorporated into Transkribus, an open source software developed within another European project called Transcriptorium.
A step further
"READ picks up the baton of this project and takes it a step further. In Transcriptorium, we advanced HTR technology and released the content to providers, that is, archives and libraries. In READ, our aim is to expand the use of the HTR technology to to large scale scenarios and provide services to major content providers," Joan Andreu Sánchez explains. UPV’s work within READ is focused on the Transkribus recognition and indexing module.
The key to the tools used by researchers at READ lies in their ability to obtain models that automatically learn from samples. Those models need a relatively small quantity of learning data to obtain very satisfying results. "Once the models are learned, highly efficient transcription techniques that use finite state models are used. A significant aspect of this process is the use of language models that use context to restrict the transcription search process," Joan Andreu Sánchez explains.
The tools permit editing and correcting possible mistakes in the automatic transcript using interactive techniques. One application of the techniques developed at READ will be able to index large collections of documents without obtaining the transcription of the document.
On demand transcription
In addition, in the future, users will be able to load a collection of images and ask the system to provide them with a transcript. "This service, which will be available through Transkribus, will be free for users in a standard service chart, while they can search ad hoc solutions for more complex problems," Joan Andreu Sánchez explains.
The READ project began the last January and will last until June 2019.
Noticias destacadas