Beyond Madoc: Development of a IIIF-based digital corpus management and text analysis platform

Joke Daems - Ghent Centre for Digital Humanities (Belgium), Sally Chambers - Ghent Centre for Digital Humanities (Belgium), Christophe Verbruggen - Ghent Centre for Digital Humanities (Belgium)

Presentation type: Lightning talk

Abstract:

Researchers wanting to perform digital text analysis first need access to texts. While this is relatively straightforward for researchers working on born-digital text, researchers working on historical text need to go through a few steps before they can start with the text analysis itself (for example, topic modelling or social network analysis). Key steps in this process are finding, collecting, and combining relevant data from a variety of digital resources, enriching the resources with annotations, and extracting the full text from those resources. It is important that the link between the extracted text and the original is maintained, ideally on a line, word or even character level, to allow researchers to easily move from a 'distant reading' to a 'close reading' perspective.

Many textual resources already exist in image form (scans and pictures from manuscripts, for example). The strength of IIIF lies in that it allows for those images to easily be exchanged and combined from different locations. As part of the Flemish participation in DARIAH (Digital Research Infrastructure for the Arts and Humanities) and CLARIN (Common Language Resources and Technology Infrastructure), the Ghent Centre for Digital Humanities supported the development of Madoc by Digirati. Madoc is an Omeka S based platform for the display, enrichment, and curation of IIIF-based digital objects. It allows for the import and presentation of IIIF manifests as well as crowd-sourced image annotations. It is a first key step in the process towards a platform that will allow researchers to collect and curate resources and extract the text layer for further analysis. What is still missing is that second step.

In a next phase, our goal is therefore to extract the text layer from the ingested images, either automatically through integration with Optical Character Recognition (OCR) software, or manually by adding transcription functionalities that are more user-friendly than the current transcription options and that maintain the direct link between words and sentences and their location on the image. The idea is to build on the image-centred focus of IIIF but to expand it towards textual data. A further envisioned enhancement to the platform is an integrated search (allowing researchers to search metadata, annotations, and full text simultaneously).

We are currently conducting a market study for the next phase of development, which should take place in the upcoming months. During the IIIF conference, we would like to present the Madoc platform (and any additional developments that will have taken place by then) from a researcher's perspective during a lightning talk: how can it help a researcher go from data collections in libraries to digital text analysis? What are some of the current bottlenecks related to the integration of IIIF and text?

Topics:

Annotation, including full-text or academic use cases,
IIIF Implementation Spectrum: large-scale or small-scale projects

Keywords:

digital text analysis,
corpus management,
DARIAH,
CLARIN,
digital humanities