Transkribus and IIIF: beneficial possibilities between image sharing and Handwritten Text Recognition frameworks

Florian Krull - University of Innsbruck (Austria), Guenter Muehlberger - University of Innsbruck (Austria), Melissa Terras - University of Edinburgh (UK)

Presentation type: Presentation

Abstract:

The Transkribus platform enables scholars to train computer models in order to automatically recognize handwritten and printed documents. There is no restriction regarding the age, style, script or language of the documents. Codices from the Middle Ages can be processed in the same way as modern manuscripts, languages such as ancient Greek, Latin or Hebrew do not impose more challenges than modern English, German or Arabic. Transkribus is designed as an open platform in order to support the collaboration among scholars, computer scientists, archives, libraries and the public. The assumption is that the more data there is in the platform, the higher will be the benefit for all users. In February 2019 more than 20,000 users were registered in the platform, hundreds of them actively transcribing documents. More than seventy institutions, such as universities, research groups, archives, and libraries have concluded collaboration agreements indicating their interest in the platform and the technology. In January 2019 more than 220 text recognition models were successfully trained by Transkribus users, indicating their diverse interests and competences. In order to sustain the service platform several universities and institutions around Europe will set up a European Cooperative Society. This legal entity will enable the team to further develop the platform and offer services on a new scale. See Mühlberger et al (Forthcoming 2019) for an overview.

One of the new services will be to explore the options to connect the IIIF technology and community with the Transkribus platform. We can think on several scenarios: First of all a simple “upload” mechanism for transferring images from a IIIF server to the Transkribus platform would enable libraries and archives to enrich their holdings with textual data. The simple delivery of the ID of the manifest would start a job a text recognition job at Transkribus. In return an updated manifest could be either delivered to the hosting institution, or the manifest could also be made available via a IIIF server based within the Transkribus environment. Secondly transcription interfaces – as they are increasingly developed at several institutions on basis of IIIF technology – may be connected with the Transkribus API in order to not only request a recognition job, but also to use the transcribed texts as training material for machine learning processes. Other options include the making available of all images within Transkribus (daily upload of users is currently around 5000 images) via IIIF, or to even move specific features, such as training or evaluation functionalities into a IIIF based web-interface.

The presentation will discuss the several options and invite the audience to come up with their own ideas and suggestions as to how these two communities and frameworks can work together.

Muehlberger, G. et. al: Transforming Scholarship in the Archives Through Handwritten Text Recognition: Transkribus as a Case Study. Journal of Documentation (In press)

Topics:

  • Annotation, including full-text or academic use cases,
  • Using IIIF material for Machine Learning and AI

Keywords:

  • OCR,
  • HTR,
  • text recognition,