inception
inception copied to clipboard
Exporting annotations as Web Annotation Data Model
Is your feature request related to a problem? Please describe.
I am evaluating INCEpTION as the tool to extract data from scanned documents into an RDF Knowledge Graph (re #4535). To produce the final data, some degree of export manipulation is necessary.
Describe the solution you'd like
I think that the closest matching standard is Web Annotation Data Model. It is defined as JSON-LD, and thus RDF. That makes it a good candidate for a first step where it would be easy to transform it to the desired RDF model. Because RDF is platform agnostic, such exports could even be fed directly to a triple store instead of processing in code.
The web annotation model would be easily extendible by extending the context. For the basic properties used by the web annotation model, if layer properties matched their names, the mapping would be simple.
Describe alternatives you've considered
I had high hopes for NIF but the export does not include. Since both are RDF formats, the web annotations could be added as part of the NIF output.
Other options involve converting one of the supported export formats
Additional context
Given annotations in RDF, it should be fairly simple to loop the newly-created knowledge directly into the knowledge base of INCEpTION, similarly to how @reckart shows in #4535. Here's are annotations from my detailed example, written as JSON-LD.
Some notes:
- I added
nif:Contextas found in NIF export (+nifprefix in context) - I added a third context to map layer properties to correct RDF. This could require a new admin setting where users could fine-tune how. Here' I only needed to add
schemanamespace, and assumed that the two properties ofSpeclayer were defined as KB entities/instance/properties. - The additional types like
urn:inception:annotatation#Entityseem unnecessary for this PoC but might be useful overall - I'm not sure if using other annotations as
target/bodyin theSpecLinkis kosher but it seems like a simple enough pattern. If anything, RDF makes such extensions easy
{
"@context": [
"https://www.w3.org/ns/anno.jsonld",
{
"nif": "http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#"
},
{
"schema": "http://schema.org/"
}
],
"@graph": [{
"id": "http://example.org/entity1",
"type": [ "Annotation", "urn:inception:annotatation#Entity" ],
"body": {
"id": "https://new.wikibus.org/vehicle/autosan/h10"
},
"target": {
"source": "_:contents",
"selector": {
"type": "TextPositionSelector",
"start": 340,
"end": 353
}
}
}, {
"id": "http://example.org/spec1",
"type": [ "Annotation", "urn:inception:annotatation#Spec" ],
"body": {
"schema:unitCode": { "@id": "http://qudt.org/schema/qudt/MilliM" },
"schema:propertyID": { "@id": "http://schema.org/length" }
},
"target": {
"source": "_:contents",
"selector": {
"type": "TextPositionSelector",
"start": 395,
"end": 401
}
}
}, {
"id": "http://example.org/spec-link1",
"type": [ "Annotation", "urn:inception:annotatation#SpecLink" ],
"body": "http://example.org/spec1",
"target": "http://example.org/entity1"
}, {
"@id": "_:contents",
"@type": [ "nif:Context", "nif:OffsetBasedString" ],
"nif:isString": "\n\n/\nRUTOSAN\n\nS P Ó tK A AKCYJNA\n\n38-500 SANOK, ul. Lipińskiego 109\nPrezes Spółki - teł. (0137) 50126, fox (0137) 50400\nDyrektor M arketingu - teł. (0137) 50282,\nDział Marketingu - tel. (0137) 50426, fax (0137) 50430\nDział Handlu i Planowania Produkcji - tel. (0137) 50253, fax (0137)50436\n\n- teiex 323577\n\nAUTOBUS PODMIEJSKI \nAUTOSAN H 10 -11.1 IN\n\nWymiary podstawowe:\n- długość całkowita 11 200 mm\n- szerokość - 2 500 mm\n- wysokość - 3 085 mm\n- rozstaw osi - 5 425 mm\n- wysokość stopnia - 380 mm\n- wysokość wnętrza - 2 025 mm\n\nMasy podstawowe:\n- masa własna - 10 500 kg\n- masa całkowita - 16 000 kg\n- dopuszczalna masa na oś przednią 6 000 kg\n- dopuszczalna masa na oś tylną - 10 000 kg\n\nNadwozie: - Konstrukcja nadwozia wykonana z rur stalowych kwadratowych i prostokątnych \nłączonych ze sobą za pomocą spawania.\n\n- ilość miejsc pasażerskich\nwózki inwalidzkie / siedzące + stojące - 2 / 3 5 + 30\n\n- rodzaj foteli - niskie sztywne\n- ilość drzwi pasażerskich - 2 + 1\n- sterowanie drzwi - przednie i tylne jednoskrzydłowe\n\nsterowane pneumatycznie ze stanowiska \nkierowcy, drzwi środkowe dwuskrzydłowe \nsterowane pneumatycznie ze stanowiska \nkierowcy\n\n\n\n\n- ogrzewanie - agregat wodny (wydajność 22 000 kcal / h)\n+ nagrzewnice\n\n- lusterka zewnętrzne - podgrzewane elektrycznie\n- wentylacja - dynamiczna za pomocą otworów\n\nwentylacyjnych w ścianie przodu, \nklapą w dachu, przez okna\n\n- wykończenie wnętrza:\nsufit i ściany boczne - pokryte płytami laminowanymi\npodłoga - pokryta wykładziną PVC typ RONDO\nściana przodu - elementy laminowane (tworzywowe i skóropodobne)\n\n- wyposażenie dodatkowe: - urządzenie do wnoszenia wózków inwalidzkich,\nsterowanie urządzenia elektryczne po otwarciu \ndrzwi środkowych\n\nPodwozie: - rama podłużnicowo - kratownicowa, współpracująca z nadwoziem\n- rozstaw kół przednich - 2 007 mm\n- rozstaw kół tylnych - 1 820 mm\n- umieszczenie zespołu napędowego - podłużnie, za tylnymi kołami\n- typ silnika - SWT 11/311/2\n- moc silnika - 176 kW przy 2200 obr/min\n- moment obrotowy - 902 Nm przy 1500 obr/min\n- pojemność skokowa -1 1 ,1 dm3\n- sprzęgło - jednotarczowe, suche, sterowanie wspomagane\n- typ skrzyni biegów - S 6 - 9 0 /? 0 3 > /q ? ó\n\n- przednia oś - sztywna, produkcji RABA\n- most napędowy - sztywny, jednostopniowy, produkcji RABA\n- przełożenie mostu - 5,37\n- zwieszenie - resory stalowe, miechy pnneumatyczne,\n\n- hamulce:\n- roboczy\n\namortyzatory teleskopowe,stabilizatory przechyłu \n\n- bębnowy, sterów, pneumatycznie, dwuobwodowy\n- awaryjny - sprężynowy, sterów, pneumatycznie\n\n- typ mechanizmu kierowniczego - 8065 licencja ZF\n- ogumienie - 11R 22,5\n- instalacja elektyryczna - 24 V\n- akumulatory - 2 x 180 Ah, 12V\n- alternator - 85 A, 2 8 V\n- rozrusznik - 4,4 kW, 24V\n- pojemność zbiornika paliwa - 185 dm3\n- prędkość maksymalna - 120 km/h\n- zużycie paliwa przy prędkości 70 km/h \n\n- pusty / obciążony - 25 / 27 dm3 /100 km\n\nZastrzega się możliwość zmiany parametrów technicznych ze względu na ciągłą \nmodernizację wyrobu.\n\n\n"
}]
}
Easily convertible to any RDF format: https://s.zazuko.com/3eUvtdZ
When loaded to memory or triple-store I could easily run a SPARQL query which converts these annotations to my desired date model:
PREFIX schema: <http://schema.org/>
PREFIX oa: <http://www.w3.org/ns/oa#>
PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>
construct {
?target ?prop
[
a schema:QuantitativeValue ;
schema:value ?value ;
schema:unitCode ?unitCode ;
] .
} where {
?annotation a <urn:inception:annotatation#SpecLink> .
?annotation oa:hasTarget/oa:hasBody ?target .
?annotation oa:hasBody/oa:hasBody
[
schema:unitCode ?unitCode ;
schema:propertyID ?prop ;
] .
?annotation oa:hasBody/oa:hasTarget/oa:hasSource/nif:isString ?pdfContents .
?annotation oa:hasBody/oa:hasTarget/oa:hasSelector
[
oa:start ?start ;
oa:end ?end ;
] .
BIND(SUBSTR(?pdfContents, ?start, ?end - ?start) AS ?value) .
}
What is your goal?
- Do you want to represent the annotations themselves as RDF and if so why?
- Or you want to represent the knowledge derived from the annotations as RDF?
If you could generate the format you want directly from the annotations using a Python script, why go through an intermediate RDF-based format?
Of course, the ultimately goal is to construct knowledge from the annotations.
That said, representing the annotations themselves as RDF has some advantages IMO. Such representation would be platform agnostic. More so than any of the supported formats because all popular languages have RDF and SPARQL have capabilities. Thus, knowledge can be derived more naturally, without the need for additional dependencies on more libraries in more languages. Second, in the case of using SPARQL for the transformation, the advantage is using a declarative language. Given the choice of an imperative script (in Python, or otherwise) or declarative transformation, I will usually choose the latter. And again, it may be more natural for any developer wishing to produce RDF from RDF than from something else, given the opportunity.
If you could generate the format you want directly from the annotations using a Python script, why go through an intermediate RDF-based format?
To put your remark around: if I could generate RDF directly from RDF, why go through an intermediate non-RDF format? :)
RDF is not a suitable format for actively working with linguistic annotation. INCEpTION internally uses the UIMA CAS representation because it allows efficient "navigation" between annotations and "reasoning" over span locations such as checking if an annotation is covered by another annotation. This is done through a special kind of index in the UIMA CAS model which allows doing such things efficiently without having to explicitly represent in statements that an annotation is left of another one or covered by another one.
NIF tries to map such capabilities into RDF so that you can do similar stuff using SPARQL, but IMHO it fails. In fact, I believe that SPARQL/RDF is in general not a good way to approach annotations in NLP.
The Web Annotation standard has a few upsides with it comes to representing ways of anchoring an annotation to an object. However, it does not standardize in any way the body of the annotation. Thus IMHO it fails to provide proper interoperability between different systems that support Web Annotation. While the systems might be able to map the annotation to the annotated object ,the interpretation of the body is likely to be different from one system to another - unless the body is just a trivial text comment. Considering where Web Annotation came from, the body-is-a-text may have been the prevalent use-case - but tools like INCEpTION operate with structured annotations.
How about something like this?
<file:/Obama.txt#6429>
rdf:type rdfcas:FeatureStructure , <uima:webanno.custom.Entity> ;
rdfcas:indexedIn <file:/Obama.txt#1> ;
cas:AnnotationBase-sofa <file:/Obama.txt#1> ;
tcas:Annotation-begin "159"^^xsd:int ;
tcas:Annotation-end "175"^^xsd:int ;
<uima:webanno.custom.Entity-iri>
"http://www.ukp.informatik.tu-darmstadt.de/inception/1.0#5557c69bcb2645ac80764c7a898ab448306" .
<file:/Obama.txt#6434>
rdf:type rdfcas:FeatureStructure , <uima:webanno.custom.Entity> ;
rdfcas:indexedIn <file:/Obama.txt#1> ;
cas:AnnotationBase-sofa <file:/Obama.txt#1> ;
tcas:Annotation-begin "179"^^xsd:int ;
tcas:Annotation-end "184"^^xsd:int ;
<uima:webanno.custom.Entity-iri>
"http://www.ukp.informatik.tu-darmstadt.de/inception/1.0#5557c69bcb2645ac80764c7a898ab448281" .
<file:/Obama.txt#6439>
rdf:type rdfcas:FeatureStructure , <uima:webanno.custom.Relation> ;
rdfcas:indexedIn <file:/Obama.txt#1> ;
cas:AnnotationBase-sofa <file:/Obama.txt#1> ;
tcas:Annotation-begin "159"^^xsd:int ;
tcas:Annotation-end "175"^^xsd:int ;
<uima:webanno.custom.Relation-Dependent>
<file:/Obama.txt#6429> ;
<uima:webanno.custom.Relation-Governor>
<file:/Obama.txt#6434> ;
<uima:webanno.custom.Relation-iri>
"http://www.ukp.informatik.tu-darmstadt.de/inception/1.0#5557c69bcb2645ac80764c7a898ab448284" .
Thank you for a comprehensive commentary on NIF and Web Annotation Model.
When looking at the latter, it did feel like its use case is slightly different than what INCEpTION does. I chose it simply because it is the closest related standard which can be mostly adapted to my needs. It also focuses on the annotations themselves and not on the NLP model which I greatly ignore. Thus, I do acknowledge how it may be an anti-goal to introduce it to INCEpTION.
A simply RDF binding to the CAS format itself would totally work too. Did you have in mind embedding it in the NIF export? In other words, exporting only the annotations from CAS as RDF. Not the tokens and all?
At this moment I'm reconsidering my earlier approach, in which I would simply convert CAS JSON to RDF using a JSON-LD context. The only issue there is that it uses numeric references but it's a simple enough JSON rewrite in a few places and I can easily get queryable RDF out of it by applying a JSON-LD context.
Unless you feel otherwise, I think this could be closed and I will continue on the linked discussion
Closing this in favor of the discussion.
If anybody encounters this in the future, finds the idea of supporting Web Annotation in INCEpTION useful and has anything to contribute to the discussion, feel free to post here or open a new feature request with a fresh perspective.