Acquisition, integration and redistribution of structured data in GLAM: harmonizing practices

Session AAP :

Paris Région PhD 2023

Scientific responsibility :

  • Laurent Romary

Funding :

  • Région Ile-de-France
  • Inria

Summary :

GLAMs (Galleries, Libraries, Archives, Museums) have now fully integrated automatic text transcription operations in their data acquisition chains. However, the majority of textual content created or in planning are acquired and redistributed in an unstructured form (raw text, without formatting or enrichment), which can be contrasted with structured text, which can be used just as a database. Layout analysis and structuring information has significant advantages: setting up faceted search engines, carrying out quantitative analyses, and enabling data interoperability. The doctoral project aims to question the feasibility and implementation of restructuring content during data acquisition or when dealing with legacy data, that is to say for content already preserved in digital collections within GLAMs, thanks to state-of-the-art machine learning technologies, in order to make corpora more accessible to users. It will also tackle the design of scenarios for a better integration of structured digital content into cultural heritage collections. The question of content modelling is also crucial, and relates to the question of the relationship between the text and its structure and images. The project will base its first experiments on the structuring of a corpus of sales catalogues, kept at the French national library. These catalogues constitute intermediaries towards concrete objects that are often inaccessible, as well as real knowledge bases for them. Once the corpus has been mastered, the technologies used will be applied to other corpora to study their robustness in the face of documentary diversity.