Honestly, metadata is really, really boring. But hey, metadata is extremely useful. Without good metadata, we can’t really solve the findability problems we have on our intranet. In order to give the right person, the right information, at the right time, at the right place and in the right way, we must use metadata extensively. Or to be more precise, we must use master metadata. In this post I’ll try to explain how we mandate the use of metadata without making it a barrier for publishing information.
Important note: I’m foremost a practitioner and I prefer systems that solve problems in a pragmatic way. I’m sure there are more elegant and correct ways of solving the problem with metadata if you ask information architects/managers and taxonomists.
There are some problems…
If it’s hard to add metadata to the information and it is mandatory to add it, then all publishers will simply use the first available metadata at hand in order to get past the mandatory metadata. This is actually even worse than no metadata. Usually we have relied on web editors to manually add keywords, but the problem is that the web editor needs (very) good domain knowledge in order to add the right metadata (keywords, subject). There should be a system that can help content editors add relevant metadata. Also, more people than web editors should be able to add metadata.
But to any problem there is a solution…
The most important thing is that it should be very easy to add metadata to any information. It needs to be designed in a way so that it only takes a few seconds to get done. So that web editors with good domain knowledge can add metadata fast. Or the web editor should be able to get help from the system, by adding metadata from a list of suggested keywords. Users should also be able to help with adding metadata to information, by tagging it.
We have decided to have three separate types of keyword metadata:
- Keywords that belong to a taxonomy like MeSH or SnoMed CT
- Keywords that are manually added by the content editor (the old standard way)
- Keywords that are added by users, i.e.. tagging
Because of the three separate types of keywords, the search engine’s relevancy model can use them in different ways, depending of the content type, content usage etc.
We have tried to design a system that is as non-intrusive as possible in order to get all content editors/contributors, users etc. to add metadata. We have developed (and open sourced) a few metadata-services (documentation now in english!), that we think are useful:
- Content analysis
- Keyword service
- Controlled lists
- Tagging service
Content analysis and the keyword service
With the press of a button, the content is sent to the metadata service where the Content analysis strips away all formatting from the content, so that the keyword service can analyze the content and identify good keyword candidates. This done by comparing the content to a larger corpus/model for information and when it finds words that are rare in the larger corpus then they are very likely to be (according to statistical models) good (unique) keywords for the document that is analyzed. The resulting keywords can then be mapped against a taxonomy, e.g.. MeSH, and if the corresponding term for the keyword is found, then we can add that as metadata as well. This is very useful as this gives the possibility for semantic data as well or linked data if you prefer. The keywords are then returned to the system that asked for the document-specific keywords. Example below with keyword suggestions for a short text about sunstroke and heatstroke from the implementation in our CMS.
What usually happens is that the keywords are presented for the content editor who then can choose the relevant keywords from the suggested list. This further improves the quality of the keywords.
Controlled lists, one type of master-metadata
Another way we use the metadata-service is to provide us with controlled lists of metadata, e.g.. Target groups, cities, subjects or document-types. We have a lot of this kind of lists and they are basically all governed by our information management people. The lists and taxonomies are all stored in our terminology/taxonomy server (Apelon DTS), an open source product. This gives us the opportunity to use the same master-metadata in many different information systems. The practical use of this is that the content editor can choose metadata from e.g. a drop-down list. Example below where the metadata element “Use for” is high-lighted:
This is necessary if we want to give the right information to the right people. This way we can use our search engine to find all documents related to a specific document type, for a specific subject, at a given geo-location that relates to a specified target group. For example we could ask the search engine to
“Give me all documents regarding personnel benefits that applies to everybody working at the HQ in Gothenburg as IT strategists”.
Of course the actual (programmatic) search query is formatted in a different way, but still it’s the same query.
User can also add valuable and usable metadata if they are allowed to tag the information. Instead of allowing any keyword to be added by an anonymous user (we wanted to avoid swearwords, dirty words etc). We wanted to automate what tags are added and what tags are not added by using a tagging service, described in the illustration below:
This is how we work with adding metadata to our information. All of the metadata-services described here are open sourced by us or others.
Any comments are very welcome.
The Information flow part 2: Information and metadata by sys 64738, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.