Voting closed 9 May, 2013. 18 Liked
Title: WWI Linked Open Data Project
Team: WWI LOD
Short description: The main objectives of the WWI LOD project are to enhance access to and create context for online collections of WWI primary sources using a Linked Open Data approach.
Long description: Using the University of Colorado Boulder’s WWI Collection Online as a test bed, we are making the historical data buried within these documents – people, places, events and topics – easier to find, use and reuse, for instance in visualizing historical patterns. Also we are supplying greater context for both the sources and data contained within them by incorporating relevant datasets into the project as well as linking to LOD data sources outside of the project like DBpedia. Further we have built structures that are meant to be shared and reused and thus bind together disparate datasets relating to the war – such as an event-based framework of military, political and social events for WWI and a specialized vocabulary on occupied France and Belgium. Although our work is ongoing, we are pleased to have made enough progress to offer the following demo for presentation at the 2013 LODLAM summit.
Try it yourself: http://purl.org/ww1lod/lodlam
The demo utilizes modern technologies such as HTML5, CSS3 and SPARQL 1.1. Our project is entirely open, meaning that we have published a subset of the data and plan publish all data and source code under open licenses. Going forward, we will further develop both the content and user interface. See more about us, our project and data models and access our datasets at http://www.seco.tkk.fi/projects/history/.
The following is excerpted from an article under review and provides a more in-depth description about the project partners, goals and datasets:
This paper investigates how Linked Data might meet user needs using CU’s World War I (WWI) Collection Online1 as a test bed. In addition to representing the collection metadata as Linked Data, our aim is to deep link historical data in the sources related to the civilian experience in occupied Belgium and France to show the kinds of complex questions that can be answered and automated methods employed in a specialized domain. This approach can help meet user needs by linking related concepts in the sources using specialized vocabularies, enriching them with additional resources, and enabling semantically rich services that empower users.
WWI Linked Open Data (WWI LOD) Project
In order to test and evaluate how well Linked Data addresses user challenges with online primary sources, we initiated a multinational, interdisciplinary WWI Linked Open Data (WWI LOD) project involving computer scientists from Aalto University2 and librarians from CU (a subject specialist/domain expert, metadata specialist and digital initiatives librarian). The primary dataset used thus far is CU’s WWI Collection Online, which comprises over 1,100 titles (55,000 pages) published from 1829 to 1922, with the vast bulk of the material published between 1914 and 1918. The provenance of the collection is not entirely clear, but it likely entered the holdings of the CU University Libraries in the 1920s or 1930s by way of the Colorado in WWI Project, which History professor James Field Willard undertook to document citizen and state activities during the war3. The materials were bound into 58 large volumes, where they remained for many decades. The collection publications originate mainly from the U.S. and touch on a variety of geopolitical regions and topics, from ethnic and religious conflict to empire and colonies. A range of genres is represented including pamphlets, books, reports, speeches and maps.
This collection was chosen as the test bed for several reasons. First, it is the largest and most diverse collection of digitized primary-source content CU has available and includes both page images and keyword-searchable text. Also it offers good general coverage of the entire war period as well as deeper coverage in areas of current scholarly interest, including empire and colonies, the civilian experience of war and occupation, and the question of US involvement in the war. Second, DBpedia, to which the project links to provide additional contextual information, is a rich source of information on people, places, events and topics associated with WWI. Third, WWI is not only of scholarly interest but also garners wide general interest. With the upcoming centenary of the war and a wide variety of organizations planning concomitant commemorative events, interest is running high on the part of cultural heritage institutions as well as scholars and the general public, thus hopefully resulting in more opportunities for collaboration between organizations.
MARC Metadata and Transformations
In June 2011, the metadata specialist used the freely-available MarcEdit 4 application to crosswalk the MARC records into MARCXML using an XSLT stylesheet. Once the records were in XML format, SeCo applied further transformations to convert them into Resource Description Framework (RDF) XML syntax for use as Linked Data. In September, the first conversion was completed, using a script which transformed the MARCXML output directly into RDF. Quality control was performed by SeCo and CU staff to determine the accuracy and effectiveness of the conversion. Some problems were evident in the converted metadata, as detailed below, as well as in the data linking. However, the conversion did provide the basics needed to start working with the metadata as Linked Data, for instance, creating Uniform Resource Identifiers (URIs) for each title in the collection. The team had been looking for ways to improve the quality of the conversion, monitoring the development of new data models for presenting MARC as Linked data. In October the Library of Congress announced its intention to develop a Linked Data-based model, called the Bibliographic Framework (BIBFRAME)5, to replace the MARC format.
The following May, LC announced a partnership with Zepheira to build a converter to transform MARC records into Linked Data6. In November 2012, the first version of such a tool was published. The converter serializes MARCXML into RDF using the BIBFRAME data model. SeCo subsequently used this tool to convert the MARCXML records into RDF again. In order to sync these two separate conversions the team is mapping the URIs from each title from the original conversion to the new one.
B. Subject Vocabularies and Related Datasets
Given user needs, one of the project’s main objectives is to enhance subject access in the online collection and create context for the documents by establishing links between data points in the collection, vocabularies and other datasets incorporated into the project, and external data sources like DBpedia and Freebase7. Another is to facilitate the annotation of and deep linking of concepts among WWI collections in a specialized historical subdomain. The topic of the civilian experience in occupied Belgium and France in WWI was selected for this additional treatment not only because it was well-represented in the collection, but also because the impact of “total war” on civilian populations is an area of current scholarly interest. Most of the publications in the collection falling into this category deal with the hardships civilians suffered during the German invasion and occupation of Belgium and northern France, particularly atrocity incidents such as killings and worker deportations and the impact of military rule on day-to-day life. The deep linking in this subdomain was accomplished mainly through manual annotation using a specialized vocabulary developed for the project.
The first subject vocabularies incorporated into the project were the Library of Congress Subject Headings (LCSH) present in the existing MARC records. In addition to LCSH, further data sets meant to enrich subject access and context have been converted to RDF, among them internal vocabularies from the Imperial War Museum (IWM), WWI timeline data from the IWM’s First World War Centenary Partnership Programme, information on German atrocities in occupied Belgium and France from contributing scholars8, and the German army hierarchy culled from standard works9. The IWM internal vocabularies are, specifically, lists of approved event keywords relating to WWI and approved geographical keywords relating to the Western Front, based on the Getty TGN taxonomy and extended by terms related to IWM collections. The timeline data serves an important purpose, as it provides the backbone for a general, event-based framework for WWI that is meant to be shared widely, thus providing the “semantic glue” that binds separate datasets relating to WWI together and allows searching and browsing between them10. This framework and other structures we have created for this project will be made freely available for reuse.
In addition to the datasets that were converted, the project utilizes large data sources already available as LOD, most notably Geonames and DBpedia. Geonames currently contains over ten million geographical names such as geographical features and populated places, including alternate names in a variety of languages that enable multilingual searching. Most of the links between Geonames and instances already existing in the project were established through automatic means, but some required manual mapping. For instance, there were multiple geographic types associated with “Somme” – an administrative district, a river, etc. – so human intervention was required to determine which to associate with the event “Battle of the Somme, 1916”. Points additional to those in Geonames needed to be manually located for events like this one; although the battle might have been named after the river, it actually took place on a swath of land nearby. DBpedia data, on the other hand, touches on all people, places, events and topics treated in Wikipedia and thus provides context across the board for instances already in the project. A test set of links was generated and then was manually evaluated by a domain expert. This process is currently being refined so that it can be applied more broadly within the project.
Finally, the team is investigating the effect of deep linking within documents on occupied Belgium and France using a specialized vocabulary. Given the specific nature of this subdomain, the inadequacy of LCSH in terms of subject specificity, and the lack of existing ontologies, this vocabulary needed to be created for this purpose. The subject specialist/domain expert compiled a list of over 270 terms, adapting terminology and structures from the standard print bibliography on Belgium in WWI11, reading of the literature on occupied Belgium and France in WWI and relevant documents from collection, and incorporating feedback from historians in the field12. Links are established wherever possible between instances generated from this vocabulary and relevant instances from other datasets included in the project, e.g., the collection metadata and IWM vocabularies. Deep linking, combined with an intelligent user interface, is designed to demonstrate the types of complex questions that can be answered to meet user needs in this subdomain, such as: Is the scale of the atrocity incidents involving German troops in France and Belgium accurately reflected in the collection literature? What divisions of the German army were involved in the most incidents, and where were they garrisoned? What was the geographic distribution of deportations from the Belgian provinces? The team hopes this type of functionality will to lead to a richer understanding of the many forces shaping the WWI period.
A description of the WWI LOD RDF dataset is available online13. Currently a subset of the project data is available as a dump or via a SPARQL endpoint and is published under a Creative Commons CC-BY-SA license. Since practically it is not feasible to associate different licenses with individual data elements, our project, like most others, unfortunately needs to make its data available under the most restrictive license attached to any given subset of data incorporated.
D. Infrastructure and Tools
Our work utilizes the existing FinnONTO ontology system14 and extends it with annotations on the datasets mentioned above using the SAHA semantic annotation tool15. Additionally, we are employing the ARPA tool16 to automate part of the annotation process. ARPA is an information extraction tool that automatically mines named entities and keyword concepts from ontologies in textual documents. The suggested annotations can then be validated and corrected manually using the SAHA editor. Further, we are also using the SILK tool17 to discover links between data in our project and DBpedia. Finally, development of a customized WWI web portal based on the faceted HAKO portal engine18 is underway to facilitate searching and browsing the data by people, places, events, topics and time periods, and to represent it in visual and interactive ways.
The team’s future plans for WWI LOD include continuing to create links between instances currently in the project, add data from the Belgian statistical abstract from 1914-1918, add entities through Latent Semantic Analysis and data mining techniques, and perhaps most importantly, create the user interface. Once it is completed, its usability will be assessed in order to identify and improve on any issues. Further, once the project is completed, we intend to do a follow-up study to determine to what extent the Linked Data approach addresses the user needs expressed in the original assessment and weigh how pragmatic implementation of the Linked Data approach would be for other projects based on the various approaches tested.
By linking related concepts across WWI datasets using specialized vocabularies and enabling semantically rich services, we hope to empower users to find and use online primary sources efficiently and effectively. The upcoming centenary of the War is already generating much interest, especially in the countries that were involved. Cultural heritage institutions and their partners can use this moment to engage users actively with the past and with the wealth of digital materials they have made available.
2 Semantic Computing Research Group, see http://www.seco.tkk.fi/.
3 These materials provided the basis for the University of Colorado Historical Collection and in turn the University Archives (David M. Hays, “The History of the Archives, University of Colorado at Boulder Libraries, 1917-2011” [unpublished paper, Archives, University of Colorado Boulder Libraries], 1-2).
5 For more on this collaborative effort, see http://bibframe.org/.
7 See http://www.freebase.com/.
8 Thanks to John Horne and Alan Kramer of Trinity College Dublin, who gathered and analyzed the atrocity data and granted permission to include it in the project.
9 Georg Tessin, Deutsche Verbände und Truppen 1918-1939 (Osnabrück: Biblio Verlag, 1974); and Hermann Cron, Imperial German Army 1914-18: Organisation, Structure, Orders of Battle, trans. C. F. Colton (Solihull: Helion, 2002).
10 Eero Hyvönen, Thea Lindquist, Juha Törnroos and Eetu Mäkelä, “History on the Semantic Web as Linked Data – An Event Gazetteer and Timeline for World War I”, in Proceedings of CIDOC 2012 – Enriching Cultural Heritage, Helsinki, Finland, June, 2012, available at http://www.cidoc2012.fi/en/File/1609/hyvonen.pdf.
11 Patrick Lefèvre and Jean Lorette, eds., La Belgique et la Première Guerre mondiale: Bibliographie, 2 vols. (Brussels: Musée Royal de l’Armée, 1987-2001).
12 Thanks are due to Martha Hanna (University of Colorado Boulder), Sophie de Schaepdrijver (Pennsylvania State University) and Tammy Proctor (Wittenberg University) for their suggestions.
13 See Juha Törnroos, Eetu Mäkelä, Thea Lindquist and Eero Hyvönen, “World War 1 as Linked Open Data”, currently under review for the Semantic Web Journal. The submission is accessible at http://purl.org/ww1lod/about/dataset_description.pdf.
15 For publications and source code download, see http://www.seco.tkk.fi/services/saha/.