Notes on hierarchy alignment
I wanted to share some of the work I’ve done at GTN-Québec on hierarchy alignment, its strengths and limits, and we shared ways to go beyond this.
First our use-case: We are harvesting bibliographic metadata from multiple sources, using different hierarchical thematic vocabularies (such as Dewey, library of congress, etc.) We want our users to find resources from all sources, using a classification term from any single vocabulary.
Most of our thematic hierarchies are small (<250 categories) and we have found it possible to align them by hand, using only the notion of related term (RT) and narrower term (NT) (from ISO 2788) across hierarchies. So it is possible to search for terms (and associated resources) along a path of RT/NT relations that may be either internal or cross-vocabulary.
Moreover, we can use a single (rich enough) vocabulary as a “pivot”, so we would establish correspondance between each vocabulary and the pivot, rather than between each pair of vocabularies. (We have chosen to use Dewey, which was the largest vocabulary we had to use, as pivot.)
There are interesting corner cases, but more often than not they can be resolved at a finer level. For example, one vocabulary lumps linguistics and literature for each “foreign” language under that language’s studies. This does not correspond to any of Dewey’s categories, but specific linguistic or literary subcategories can be related as narrower terms to the appropriate Dewey categories. In some more complex cases, some categories find themselves between two Dewey categories, or vice-versa.
In general, our choice has been to let some categories dangle if they can be resolved at the subcategory level, which means that resources at that abstract level cannot be found; thus we have favoured precision over recall. But sometimes, there are no sub-categories to use, and we are reaching the limits of this approach.
In the discussion, we gave many examples of “almost identical” categories, especially across languages; someone mentioned how the German “Gemüse” category (roughly our vegetables) excluded potatoes. We raised the prospect of codifying exceptions (NOT narrower term), but we agreed using negation was probably more demanding than it was worth.
First, we agreed that for search, lack of precision was a comparatively minor flaw, and that we should distinguish matches by a degree of quality. Path traversals with a cost at each step are a known solution to this problem; but we need also to convey to the users that some results are more or less certainly relevant.
Chris McDowall mentioned how resources belonging to a category was contextual; and explained how supervaluationism distinguishes context-free “super-truths” from contextual truths. Concretely, this could be translated to element membership being qualified by the number of people asserting it. (This might be related to Yves Raymond’s session on user feedback for categories.)
One big point of agreement is that categories equivalence should be determined through Big Data in general, and not by hand. Michel Gagnon proposed using document classification as a way to infer categories equivalence; in a private conversation, I mentioned the issue of corpus in different languages, and he proposed using dbpedia terms as a pivot.
We also mentioned the possibility of inferring the taxonomy itself rather than only equivalence between existing taxonomies; could there be intermediate cases between the rigidity of fixed vocabularies and the anarchy of folksonomy?
We also mentioned tools for multifaceted classification, such as feature lattices or Componential analysis; and methodologies for ontology alignment (such as this one for another domain, using category theory.)