rdflib.js icon indicating copy to clipboard operation
rdflib.js copied to clipboard

Handle Atom XML feeds

Open timbl opened this issue 5 years ago • 7 comments

Example: http://dbpedia.org/data/Massachusetts_Institute_of_Technology.atom

produces from rdflib:

Error: Fetcher: Unsupported dialect of XML: not RDF or XHTML namespace, etc. <?xml version="1.0" encoding="utf-8" ?> <feed xmlns="http://www.w3.org/2005/A 

at Fetcher.failFetch 

 at XMLHandler.parse (https://timbl.com/timbl/Automation/Library/Mashup/mashlib.js:9994:22) at https://timbl.com/timbl/Automation/Library/Mashup/mashlib.js:11543:24 at async Promise.all (index 8)
store = context.session.store

The XML resource starts:

<?xml version="1.0" encoding="utf-8" ?>
<feed 
	 xmlns="http://www.w3.org/2005/Atom" 
	 xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" 
	 xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" 
	 xmlns:georss="http://www.georss.org/georss/"
	 xmlns:dbp="http://dbpedia.org/property/"
	 xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
	 xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	 xmlns:foaf="http://xmlns.com/foaf/0.1/"
	 xmlns:dbo="http://dbpedia.org/ontology/"
>
	<id>http://dbpedia.org:8891/data/Massachusetts_Institute_of_Technology.atom</id>
	<updated>2020-12-14T17:20:55.266081Z</updated>
	<author><name /></author>
	<title type="text">OData Service and Descriptor Document</title>
	<entry>
		<id>http://dbpedia.org/resource/Massachusetts_Institute_of_Technology</id>
		<link rel="http://www.w3.org/2002/07/owl#sameAs" href="http://es.dbpedia.org/resource/Instituto_Tecnológico_de_Massachusetts"/>
		<link rel="http://www.w3.org/1999/02/22-rdf-syntax-ns#type" href="http://dbpedia.org/class/yago/Institution108053576"/>
		<link rel="http://www.w3.org/1999/02/22-rdf-syntax-ns#type" href="http://www.wikidata.org/entity/Q3918"/>
		<link rel="http://purl.org/linguistics/gold/hypernym" href="http://dbpedia.org/resource/University"/>

timbl avatar Dec 14 '20 16:12 timbl

I'm getting the same issue on all EU controlled vocabularies such as https://publications.europa.eu/resource/authority/distribution-status which starts like this :

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF

jeff-zucker avatar Nov 30 '24 04:11 jeff-zucker

Changing https://github.com/linkeddata/rdflib.js/blob/82f693dc10d5b915f97eedeab5351cf831c3611e/src/fetcher.ts#L373 to

 if (_ns && _ns === _ns['rdf'] || _ns.match(/(feed|rdf)/)) { 

solves the problem but I don't know what unintended consequences it would have.

jeff-zucker avatar Nov 30 '24 05:11 jeff-zucker

http://dbpedia.org/data/Massachusetts_Institute_of_Technology.atom has a representation available in application/atom+xml (with valid Atom payload). It can be parsed as Atom but not RDF/XML.

https://publications.europa.eu/resource/authority/distribution-status has a representation available in text/xml (with valid RDF/XML payload). It can be parsed as RDF/XML but not Atom.

csarven avatar Nov 30 '24 08:11 csarven

https://publications.europa.eu/resource/authority/distribution-status has a representation available in text/xml (with valid RDF/XML payload). It can be parsed as RDF/XML

Parsed by what software? Not by rdflib, as shown above.

jeff-zucker avatar Nov 30 '24 16:11 jeff-zucker

Let me clarify. Strictly speaking, representations in application/rdf+xml media type can be parsed as RDF/XML ( https://www.w3.org/TR/rdf-syntax-grammar/#section-MIME-Type ). This is not true for application/atom+xml or text/xml because they are structurally different. They are all part of the XML family but that's about it.

So, while rdflib.js could process representations with application/atom+xml or text/xml media types through an RDF/XML parser, there is potential that the structure will not conform to RDF/XML, and hence an error will be thrown as they'd be invalid. If the proposal is that rdflib.js should try any way (this issue?), it must take that into account. Or put differently, separating the intended media type (from the HTTP headers) on how the payload should be processed, and how the parser should be applied to any string.

Some application/atom+xml or text/xml in the wild may include RDF/XML (as per examples) so the processing could start from root or a particular subtree as per rdflib.js or higher-level software using rdflib.js passing a parameter ("yes, I know this is not RDF/XML but try to parse it that any way because there may be a subtree with valid RDF/XML.").

I want to also point out that the examples http://dbpedia.org/data/Massachusetts_Institute_of_Technology.atom and https://publications.europa.eu/resource/authority/distribution-status are technically different in that the former is about handling Atom (this issue) and the latter is about handling XML (which could be a separate issue since it is not related to the Atom case).

Ultimately the parser is going to treat the Atom as "soup" and swallow the errors because there is potential of things that's not recognised by an RDF/XML parser - and glean whatever RDF statements it can. Ditto parsing any random XML document.

As for the specific example with representation with text/xml media type, that can be passed through an RDF/XML parser because it seems to be an ordinary RDF/XML document but rdflib.js wouldn't have that knowledge, which is why a high-level software using it may. I don't have strong opinions on this. Just laying it out.

I would say that the text/xml example is a bad server configuration - in this particular case, it is not horrible since it is still XML, but it does not fully respect the web architecture by using the intended media type for the data. If rdflib.js were to ignore server's authoritative metadata with text/xml, then it'd be guessing the representation data when it proceeds with RDF/XML parsing. This might be acceptable in controlled environments but generally rdflib.js shouldn't parse anything besides it being a plain XML document.

The application/atom+xml example is a good server configuration. The representation should be parsed as Atom. rdflib.js should avoid going against the web architecture by only parsing it as RDF/XML. rdflib.js shouldn't parse as RDF/XML without user's consent (or put differently, without the user/developer explicitly forcing rdflib.js).

Does that help clarify?


As for your question, some software indeed does parse the text/xml as RDF/XML (you'll have to dig into its libraries and design decisions, there may be historical reasons for it, e.g., software like Raptor will treat that example Atom resource as rss-tag-soup and find RDF statements).

Example software parsing:

  • https://www.w3.org/RDF/Validator/rdfval?URI=https%3A%2F%2Fpublications.europa.eu%2Fresource%2Fauthority%2Fdistribution-status&PARSE=Parse+URI%3A+&TRIPLES_AND_GRAPH=PRINT_TRIPLES&FORMAT=PNG_EMBED

  • http://rdf.greggkellogg.net/distiller?command=serialize&url=https:%2F%2Fpublications.europa.eu%2Fresource%2Fauthority%2Fdistribution-status

  • https://librdf.org/raptor/

  • https://github.com/rdf-ext-archive/rdf-parser-rdfxml

Example software not parsing:

  • https://sparql.org/sparql?query=SELECT+*%0D%0AFROM+%3Chttps%3A%2F%2Fpublications.europa.eu%2Fresource%2Fauthority%2Fdistribution-status%3E%0D%0AWHERE+%7B+%3Fs+%3Fp+%3Fo+%7D&default-graph-uri=&output=xml&stylesheet=%2Fxml-to-html.xsl

  • https://rdf-play.rubensworks.net/#url=https%3A%2F%2Fpublications.europa.eu%2Fresource%2Fauthority%2Fdistribution-status&proxy=https%3A%2F%2Fproxy.linkeddatafragments.org%2F

  • https://github.com/rdfjs-base/fetch

csarven avatar Dec 02 '24 12:12 csarven

@csarven - regardless of the fact that the EU server sends "text/xml" as the content-type, the document begins with

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF

Isn't this namespace declaration sufficient to indicate the document should be parsed as RDF? Is there any reason that rdflib, on finding text/xml as a content-type can't then check the namespace and decide that it is indeed parseable?

jeff-zucker avatar Dec 16 '24 19:12 jeff-zucker

There is an inconsistency between author's and server's intent. The media type provided in the Content-Type is always considered authoritative as to how to interpret the representation. In this scenario, if rdflib.js invokes its RDF/XML parser, it is acting at its own discretion. By doing so, it is performing content sniffing to identify whether there is a mismatch between the authoritative metadata and the actual content.

To answer your question, I presume detecting the RDF namespace in standalone XML documents served as text/xml would be "sufficient" to invoke the RDF/XML parser. As mentioned earlier, for that kind of silent error recovery and obtaining user consent should also be documented and/or coded (flags) - it does not imply interrupting the user's experience or other processing.

csarven avatar Dec 22 '24 13:12 csarven