semanticchemistry icon indicating copy to clipboard operation
semanticchemistry copied to clipboard

cheminfo.owl has invalid UTF-8 characters, will not load in Apache Jena

Open stevevestal opened this issue 4 years ago • 0 comments

The PubChemRDF vocabulary https://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary.owl lists as one of its imports http://semanticscience.org/ontology/cheminf.owl

When used as a URL, it is forwarded to https://raw.githubusercontent.com/semanticchemistry/semanticchemistry/master/ontology/cheminf.owl

When Apache Jena attempts to load the ontology from that URL, it causes a RiotException (invalid format):

[line: 1681, col: 145] Invalid byte 3 of 3-byte UTF-8 sequence.

The self-declared IRI (ontology rdf:about) for the fetched document is yet a third thing, http://semanticchemistry.github.io/semanticchemistry/ontology/cheminf.owl. Fetching using that as a URL gets the same error. http://semanticchemistry.github.io does not have an issue tracker (a shout out to NIH PubChem support for pointing me to this tracker).

There are multiple occurrences of invalid UTF-8 characters. When I loaded the ontology into an Eclipse text editor and forced a save, it provided an option to save with all characters converted to UTF-8. This allowed the ontology to be loaded.

stevevestal avatar Nov 24 '20 10:11 stevevestal