rdftab.rs icon indicating copy to clipboard operation
rdftab.rs copied to clipboard

stacktrace when dealing with entity expansions

Open cmungall opened this issue 4 years ago • 4 comments
trafficstars

I get an error when loading ogg.owl.

(venv) ~/repos/semantic-sql(main) $ export RUST_BACKTRACE=full
(venv) ~/repos/semantic-sql(main) $ ./bin/rdftab db/ogg.db < owl/ogg.owl 
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RdfXmlError { kind: Xml(EscapeError(UnrecognizedSymbol(1..4, Ok("obo")))) }', src/main.rs:57:5
stack backtrace:
   0:        0x109357e45 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h30b85a1761190f28
   1:        0x10937644e - core::fmt::write::h5b0722e6ee659e34
   2:        0x109356269 - std::io::Write::write_fmt::hf468289e762fa2f9
   3:        0x10935aa9a - std::panicking::default_hook::{{closure}}::h836d46ca6b872224
   4:        0x10935a7bf - std::panicking::default_hook::h2afcf1998cd93f8c
   5:        0x10935b0ed - std::panicking::rust_panic_with_hook::he4f5d8b43533efd5
   6:        0x10935ac82 - rust_begin_unwind
   7:        0x109378acf - core::panicking::panic_fmt::h3559129da805eab4
   8:        0x109378b45 - core::result::unwrap_failed::h170de03e7ee26a1a
   9:        0x1093423a5 - rdftab::main::h1bc34813cbf130e1
  10:        0x1093316a6 - std::rt::lang_start::{{closure}}::h63a82885a43041b4
  11:        0x10935ab58 - std::panicking::try::do_call::h29bd6a8b4eb65398
  12:        0x10936318b - __rust_maybe_catch_panic
  13:        0x10935e389 - std::rt::lang_start_internal::h1cbb853ed77189ce
  14:        0x109342559 - main

my prefix table is normal

sqlite> select * from prefix where prefix='obo';
obo|http://purl.obolibrary.org/obo/

This is just using the standard version of ogg

curl -L -s http://purl.obolibrary.org/obo/ogg.owl

This seems to be caused by entity expansions.

<?xml version="1.0"?>


<!DOCTYPE rdf:RDF [
    <!ENTITY foaf "http://xmlns.com/foaf/0.1/" >
    <!ENTITY owl "http://www.w3.org/2002/07/owl#" >
    <!ENTITY obo "http://purl.obolibrary.org/obo/" >
    <!ENTITY dc "http://purl.org/dc/elements/1.1/" >
    <!ENTITY xsd "http://www.w3.org/2001/XMLSchema#" >
    <!ENTITY iao "http://purl.obolibrary.org/obo/iao/" >
    <!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#" >
    <!ENTITY ncbitaxon "http://purl.obolibrary.org/obo/ncbitaxon#" >
    <!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
    <!ENTITY oboInOwl "http://www.geneontology.org/formats/oboInOwl#" >
    <!ENTITY protege "http://protege.stanford.edu/plugins/owl/protege#" >
]>


<rdf:RDF xmlns="&obo;ogg.owl#"
     xml:base="&obo;ogg.owl"
...

Jena also has issues:

$ riot --out RDFXML owl/ogg.owl > owl/ogg-riot.owl
11:02:17 ERROR riot            :: [line: 1, col: 1 ] JAXP00010001: The parser has encountered more than "64000" entity expansions in this document; this is the limit imposed by the JDK.

I have dreaded memories of this kind of error from earlier experiences with RDF/XML, but haven't seen it for a while

The solution is to launder through robot:

robot convert -i owl/ogg.owl -o owl/ogg-robot.owl && mv owl/ogg-robot.owl owl/ogg.owl

More graceful handling upstream would be welcome, but not urgent as there is a workaround.

cmungall avatar May 23 '21 18:05 cmungall

This must be happening in rio. Hopefully there's a setting we can tweak, or just update the dependency, because I do not want to get into the core of this.

jamesaoverton avatar May 24 '21 12:05 jamesaoverton

@lmcmicu Can you check whether recent updates to rio_xml resolve this problem? This commit seems relevant: https://github.com/oxigraph/rio/commit/bb81f95d5cdf6dfcd278d92a2d51bf154a166fb5

jamesaoverton avatar Jun 21 '21 15:06 jamesaoverton

Will do.

lmcmicu avatar Jun 21 '21 15:06 lmcmicu

Yes it seems to work if we use the updated rio.

$ make build/ogg.db
rm -f build/ogg.db
sqlite3 build/ogg.db < build/prefix.sql
rdftab build/ogg.db < ogg.owl

If you would like to try it out by hacking the files on the master branch, then change the [dependencies] block in Cargo.toml to this one: https://github.com/ontodev/rdftab.rs/blob/271c36f3670fe1104c1b62da82f5538b2631e0c9/Cargo.toml#L7 and also comment out the [patch.crates-io] block: https://github.com/ontodev/rdftab.rs/blob/271c36f3670fe1104c1b62da82f5538b2631e0c9/Cargo.toml#L25

Then in src/main.rs you must add an include for Iri: https://github.com/ontodev/rdftab.rs/blob/271c36f3670fe1104c1b62da82f5538b2631e0c9/src/main.rs#L8 and you must change the call to RdfXmlParser so that it looks like this: https://github.com/ontodev/rdftab.rs/blob/271c36f3670fe1104c1b62da82f5538b2631e0c9/src/main.rs#L925

Then don't forget to run cargo build --release

The permanent solution will involve merging this branch into master: https://github.com/ontodev/rio/tree/merge-upstream-ws-changes

lmcmicu avatar Jun 22 '21 15:06 lmcmicu