These are the session notes (sketchy I’m afraid) for the discussion on curation of linked open data on day 1 of the 2013 LODLAM summit in Montreal. There are multiple ways to look at curation and that can be seen in the different slants brought into the mix – curation of the data that the agency or person has (and its state or fitness for reuse and supply) and the data that it is desirable to link to (why and what does that mean). It is no surprise that questions of control and authority emerged and questions around reliance and co-contribution. What is the perfect combination and how long will those combinations of data complement each other?
The wording in (brackets) is mine from recall. Please feel free to comment and correct me if I’ve misinterpreted the notes.
These are the session notes (rough I’m afraid) for the discussion on making the case for linked open data on day 1 of the 2013 LODLAM summit in Montreal. At some point I’d really like to summarise these ideas better or maybe get to a point where it is possible to tell success stories and cautionary tales so that those interested in making or reusing LOD can pick up and expand on the precious work done thus far.
The wording in (brackets) is mine from recall. Please feel free to comment and correct me if I’ve misinterpreted the notes.
This was originally Normalizing Licensing and Data Models, but we decided that was too much to take on in one session. We had about 15 participants. I did my best to lead this session though was admittedly a bit exhausted! And now I’ve let too much time go by before getting my notes in here.
I started by describing some of the work we’re doing at Historypin to create metadata crowdsourcing and annotation tools for the public and in particular cultural heritage institutions. We talked briefly about our current efforts to consider the data models of Europeana and DPLA, as well as Open Annotation, and how we might incorporate some of this in as simple a way as possible, as we don’t want to differentiate between individuals and institutional contributors. I threw out this worksheet for comparing licensing across various platforms and would welcome anyone to add other examples to it (thanks Antoine Isaac for adding a bit to this already).
I think we agreed that we we’ve come a long way from where we were 2 years ago at the last summit, when the 4 star scheme of open licensing of metadata was launched. Jerry Persons talked about Stanford policy and also about the week long workshop they held in July of 2011 recommending CC0 for all bibliographic metadata.
We talked a bit about international issues of copyright and licensing, with Chris of Digital New Zealand weighing in with the very good point that CC0 is not an option in New Zealand, or at least not respected by New Zealand law. Romain from French National Library echoed this issue for France.
Romain also talked about the difference between what is copyrightable at all, and that courts in France have tested the difference between non-intellectual or creative content vs fact, which we agreed there is international precedent for, and I pointed out that we (at Historypin) are following the lead of the DPLA on this front.
From here we ventured a bit into creating and encouraging a culture of sharing in which institutions/individuals that share with open licensing could get some recognition, as well as some potential centralized site for tracking changes. We discussed the Cooper Hewitt release on to Github, though it was pointed out that Github was putting a 15mb limit on files. The OpenGLAM Data Hub could be a great shared source for us to list content. We talked about the importance/potential about combining forces across GLAMs internationally and agreeing that this would be a good place to share and as importantly, to show uses of and improvements to metadata.
We touched briefly on burnout on behalf of content providers that work very hard to release datasets and then not have anyone use them, or not know about reuses of the datasets, so encouraging this kind of community and circling back is critical.
I’m sure I missed a ton, please feel free to make additions/corrections/etc in the comments or in the notes doc directly.
Bonjour LODLAM community!
It has been a great pleasure to meet you all at LODLAM 2013 this spring/summer! Now, I am trying to recall this amazing event and trying to put the pieces together after two weeks.
The LODLAM 2013 Timetable -draft is what I have tried to rebuild for an overall picture. As you can see from the table, there are several names of facilitators missing. If you know the name of that session, and would like to help me to rebuild this landscape, please do not hesitate to drop me a note. Also, for the 7 colors of session blocks in the table indicating my own understanding of the issues, if you do not agree with my classification (especially for the view of the session facilitators), please let me know.
To witness this community doing something entirely different feels fantastic!
with great appreciation
andrea hunag from taipei taiwan
Mismuseos.net: Art After Technology (putting cultural data to work)
MISMUSEOS.NET MONTREAL CANADÁ.
LAM: Libraries, Archives and Museums are the places of our collective memory.
LOD: They contain a hidden graph, made up of nodes (entities) and lines (relations) with enormous possibilities of discovering and knowledge.
SEMANTIC WEB: Now we can compute those concealed relationships, and expose the connexions inside our collective memory graph under some conditions.
DESCRIPTION OF THE PROBLEM
The data of the museums are distributed and not connected. There are more than 55,000 museums in 202 countries. We cannot do interesting exploitations using the capacity of the machines with the information contained in the current formats of knowledge representation
So, the first part of the problem consists of building a Museums Micro Cloud of Linked, Clean and Curated Data with an underlying Specialized and Unified Graph
Secondly, we wanted to connect cultural and educational worlds in a knowledge ecosystem. In other words, we wanted to valorise cultural information of our cultural heritage for educational purposes.
Our project shows the way for overcoming the challenge of linking all the resources of all museums, by making real that possibility for a group of Spanish Greatest Museums.
The project is a free access online solution available in the web address http://mismuseos.net.
MisMuseos.net, gathers museum metadata from multiple Spanish Public Institutions. It is a semantic Museum of Museums.
It works according to the standards of the Semantic Web and the principles of the Linked Open Data Web.
We currently have a collection of seven Spanish Great Museums (a meta-museum), where users can browse over 17,000 pieces of art and 2,650 artists.
Mismuseos.net allows users to find and discover museums-related content, and also reach some related external information thanks to the correlation with other datasets.
The main goal of Mismuseos.net is to present a case of exploitation of Linked Data for the G.L.A.M. community through innovative end-user applications, like facet-based searches and semantic context creation, which drastically improve user experience, built on GNOSS, a semantic and social software platform with a deep focus on the generation of social knowledge ecosystems and end-user applications in a Linked Data environment.
In more detail, the project is guided by the following goals:
TECHNOLOGY AND MAIN FEATURES:
The solution has been developed on gnoss.com
The featured applications in Mismuseos.net are:
1. Contexts for the entity ‘piece of art’
2. Contexts for the entity ‘artist’
HOW OUR INNOVATIVE IDEAS WILL ADVANCE THE GLAM COMMUNITY
Direct advances/ benefits:
Other potential advances:
The main problems we have faced in this project have been:
We think we have shown a possible way to travel and solve this set of problems
I wanted to share some of the work I’ve done at GTN-Québec on hierarchy alignment, its strengths and limits, and we shared ways to go beyond this.
First our use-case: We are harvesting bibliographic metadata from multiple sources, using different hierarchical thematic vocabularies (such as Dewey, library of congress, etc.) We want our users to find resources from all sources, using a classification term from any single vocabulary.
Most of our thematic hierarchies are small (<250 categories) and we have found it possible to align them by hand, using only the notion of related term (RT) and narrower term (NT) (from ISO 2788) across hierarchies. So it is possible to search for terms (and associated resources) along a path of RT/NT relations that may be either internal or cross-vocabulary.
Moreover, we can use a single (rich enough) vocabulary as a “pivot”, so we would establish correspondance between each vocabulary and the pivot, rather than between each pair of vocabularies. (We have chosen to use Dewey, which was the largest vocabulary we had to use, as pivot.)
There are interesting corner cases, but more often than not they can be resolved at a finer level. For example, one vocabulary lumps linguistics and literature for each “foreign” language under that language’s studies. This does not correspond to any of Dewey’s categories, but specific linguistic or literary subcategories can be related as narrower terms to the appropriate Dewey categories. In some more complex cases, some categories find themselves between two Dewey categories, or vice-versa.
In general, our choice has been to let some categories dangle if they can be resolved at the subcategory level, which means that resources at that abstract level cannot be found; thus we have favoured precision over recall. But sometimes, there are no sub-categories to use, and we are reaching the limits of this approach.
In the discussion, we gave many examples of “almost identical” categories, especially across languages; someone mentioned how the German “Gemüse” category (roughly our vegetables) excluded potatoes. We raised the prospect of codifying exceptions (NOT narrower term), but we agreed using negation was probably more demanding than it was worth.
First, we agreed that for search, lack of precision was a comparatively minor flaw, and that we should distinguish matches by a degree of quality. Path traversals with a cost at each step are a known solution to this problem; but we need also to convey to the users that some results are more or less certainly relevant.
Chris McDowall mentioned how resources belonging to a category was contextual; and explained how supervaluationism distinguishes context-free “super-truths” from contextual truths. Concretely, this could be translated to element membership being qualified by the number of people asserting it. (This might be related to Yves Raymond’s session on user feedback for categories.)
One big point of agreement is that categories equivalence should be determined through Big Data in general, and not by hand. Michel Gagnon proposed using document classification as a way to infer categories equivalence; in a private conversation, I mentioned the issue of corpus in different languages, and he proposed using dbpedia terms as a pivot.
We also mentioned the possibility of inferring the taxonomy itself rather than only equivalence between existing taxonomies; could there be intermediate cases between the rigidity of fixed vocabularies and the anarchy of folksonomy?
We also mentioned tools for multifaceted classification, such as feature lattices or Componential analysis; and methodologies for ontology alignment (such as this one for another domain, using category theory.)
Only four participants! Antoine Isaac (Europeana), Romain Wenz (BnF), Ryan Donahue (Met), Cate O’Neill (Find&Connect)
As it appears, there are more urgent issues to solve for LODLAM.
In fact issues are similar to the ones that were raised about WWW long ago. As WWW survives them, maybe LD can survive them too. It however seems tricky for ‘reference’ datasets. And what would happen when you re-use others’ data?
Some (only slightly curated) bullet points:
- Basic issue: allowing decentralized data access and use, preservation beyond basic requirement of persistent URIs. Data/links can change!
- Handling updates similar to what happens for historical place names in catalogues. (scope of “The netherlands” as of 1821, as opposed to later).
- Preserving context: keeping different levels of truth, different parts of the provenance (time and data producers)
- RDF triples make time and data provenance tricky to represent, unless we go for quadruple or versioned URIs (which have their disadvantages). BnF more-or-less tracks manually (on demand) the provenance.
- Serve representations (data) for which “versions” of a resource (URI)? Interest of an “historical GET”, comparable to Memento (www.mementoweb.org).
Basic solution: no versioned URIs for the resource. but keep track of different versions of the representations (RDF data, HTML page). data.bnf.fr uses Internet archive to archive its representations (just one canonical representation – RDF/XML – for each URI)
Creating Dataset of datasets to find their archive back?
- How to decide what what to preserve/give access to? Everything/every version? Linked data users probably want to get is “best” for the identifier. And it may change! E.g., deprecating some names in authorities from preferred to alternative.
BnF has some cases, where people ask to remove data (Birth dates, Attributions that are not good for the reputation). In such cases, it’s not really desirable to even keep track of historical data in the authoritative service.
Should we mint/re-use URIs or HTTP code for saying that data was removed?
Note: cf OAIS: preservation success is success *for humans*!
- Examples of linked data that was not preserved?
Probably some Talis datasets.
- Misc. remarks on persistent identifiers.
A trick to preserve identifiers is embed identifiers inside other identifiers. But needs some resolver service!
URI design: problem of meaning attached to the URI. We need to separate description function from identification one.
The teaching LODLAM session discussed curriculum development, approaches to pedagogy, and what needs to we have to teach linked data in the classroom. We are starting a google group to continue this conversation.
The World War 1 session ran a little over time and spilled out over lunch outside with a lot of talk about the war, literature and linking across data sets. I’ve copied here the people who listed their information on the sheet.
|EU||Australia War Literature||Europeana WWI||“Isaac, A.H.J.C.A.”
|UK||Trenches to Triples||King’s College||Geoffrey Browell
|Australia||Australia War Literature||http://www.austlit.edu.au/|
|Canada||Out of the Trenches / Au-Delà des tranchéss||Pan Canadian Documentary Heritage||Pat Riva||http://www.ghamari.net:8080/canada/|
|New Zealand||Remembering WW1||?||?||ww100.govt.nz|
|UK||Open Metadata gateway||King College London Archives||Geoff Browell|
|Finland / US||WW1LOD Project||Semantic Computing Research Group / Aalto||Thea Lindquist, Hyvönen Eero et al.||http://purl.org/ww1lod|
|France||Awesome rdf-enabled online library||French National Library||Romain Wenz||data.bnr.fr|
|Canada||Muninn WW1 Project||-||Rob Warren||rdf.muninn-project.org/sparql|
Where do we go from here?
Suggest that you look at the lodlam group and signup to the ww1-lod mailing lists. We have had some very good talk about integrative over GIS information and integrating data over multiple sparql servers.
Keep in touch and keep doing great work!