Development of Content Retrieval System of Scrapbook “Kunshujo” using IIIF and Deep Learning

Satoru Nakamura - The University of Tokyo (Japan)

Presentation type: Lightning talk

Abstract:

Background and Objectives

 We present a system to retrieve the contents of "Kunshujo", which is a collection of over 100 scrapbooks (https://iiif.dl.itc.u-tokyo.ac.jp/repo/s/tanaka/document/d230f1f8-5929-4138-bb28-1108d77fd32a). Various materials such as pamphlets and product labels from the late Edo period to the Taisho era (from 1603 to 1926) collected by Yoshio Tanaka, who was a naturalist in the Meiji Era, are pasted on the pages. The library of The University of Tokyo released these IIIF images in book units in August 2018. However, it was desired to search in pasted material units on the pages. In order to solve this problem, we developed a system that can search over 35,000 pasted materials using IIIF annotation.

 For this annotation process, we used two methods; manual annotation with Omeka IIIF Toolkit and automatic annotation with Deep Learning. The following describes each annotation method and the search system based on them.

Manual annotation with Omeka and IIIF Toolkit

 We developed an environment that allows multiple users to annotate online using Omeka and IIIF Toolkit plugin. 15 books were targeted, and 2,713 pasted materials on 826 pages were annotated by 6 collaborators. Also, since other organization in The University of Tokyo has released catalog data (titles, dates, place names, etc.) of those annotated materials, we associated images with those catalog data. Specifically, we converted the annotation data into data conforming to IIIF Curation API defined by CODH (Center for Open Data in the Humanities) and described the catalog data as metadata of each curation data.

 Using these annotated data, we developed a system that searches pasted materials together with catalog data using IIIF Curation Platform developed by CODH (https://kunshujo.dl.itc.u-tokyo.ac.jp/table.html). This system also provides the functions to browse IIIF images on maps and chronologies. This system is characterized by using various resources (IIIF, open data, and open source software) provided by multiple institutions.

Automatic annotation with Deep Learning

 We used the object detection algorithm YOLO v3 for automatic annotation. For the learning process, the manual annotations mentioned above were used for fine tuning. As a result, 34,746 pasted materials (annotations) on 13,069 pages from 99 books were automatically extracted. These annotations were also converted into data conforming to IIIF Curation API.

 Using these extracted materials, we developed a similar image retrieval system (http://kunshujo-i.dl.itc.u-tokyo.ac.jp/#/search?index=11854). The reason for using image information is that the catalog data of these materials has not been created yet. We used 2,048 dimensional feature vectors extracted using the inception-v3 model, and cosine similarity was used to calculate image similarity. This system enables to extract similar or same pasted materials used in different books and pages.

Conclusion

 We described a case based on hybrid annotations by humans and computer. We confirmed that IIIF contributes to reduce the cost for creating training data of machine learning and developing applications.

Topics:

  • Annotation, including full-text or academic use cases,
  • Using IIIF material for Machine Learning and AI

Keywords:

  • Annotation,
  • Machine Learning,
  • Curation