Omitting xmlns in parsed literals with rdf:HTML datatype
When the parser encounters:
<p ... datatype="rdf:HTML"><span>foo</span></p>
The value comes out as
<span xmlns="http://www.w3.org/1999/xhtml"></span>
Is it possible to omit the xmlns in the output so that the value is "as is": <span>foo</span>?
We used the features option when initialising RdfaParser:
features: {
xmlnsPrefixMappings: false,
}
but that seemed to strip out all of the markup. Not sure if that flag was right to begin with but that approach also seems to override all of the default settings (which seems to be true).
So:
- Is it possible to omit the xmlns namespace? If not, is this something that can only be done post-parsing?
- Is it possible to override one of the features?
Is it possible to omit the xmlns namespace? If not, is this something that can only be done post-parsing?
AFAIK, there's no way to do this at the moment. We would need some changes in the code before this can be done.
Is it possible to override one of the features?
The approach you follow with defining options should be correct.
features: {
xmlnsPrefixMappings: false,
}
should only set a value for xmlnsPrefixMappings, but keep all other options to their default.
We're additionally having an issue where getting unexpected output when parsing HTML, where the input is:
<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title></title>
<meta content="width=device-width, initial-scale=1" name="viewport" />
</head>
<body about="" prefix="rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# schema: http://schema.org/">
<main>
<article>
<div datatype="rdf:HTML" id="content" property="schema:description">
<p>foo</p>
<div rel="schema:hasPart" resource="#bar">
<p property="rdf:HTML">bar</p>
</div>
</div>
</article>
</main>
</body>
</html>
So we're expecting the description value to have:
<p>foo</p>
<div rel="schema:hasPart" resource="#bar">
<p property="rdf:HTML">bar</p>
</div>
But instead the parser is giving us:
<p xmlns="http://www.w3.org/1999/xhtml" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:schema="http://schema.org/">foo</p>
<div rel="schema:hasPart" resource="#bar" xmlns="http://www.w3.org/1999/xhtml" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:schema="http://schema.org/">
</div>
So there's a lot of content that is being stripped out.
Is there some config or option that we're overlooking?
Slight correction on the example input above but the issue is the same.
Input:
<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title></title>
<meta content="width=device-width, initial-scale=1" name="viewport" />
</head>
<body about="" prefix="rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# schema: http://schema.org/">
<main>
<article>
<div datatype="rdf:HTML" id="content" property="schema:description">
<p>foo</p>
<div rel="schema:hasPart" resource="#bar">
<p property="schema:description" datatype="rdf:HTML"><span>bar</span></p>
</div>
</div>
</article>
</main>
</body>
</html>
Output:
<https://dokie.li/tmp/test.html#bar> <http://schema.org/description> "<span xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:schema=\"http://schema.org/\">bar</span>"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
<https://dokie.li/tmp/test.html> <http://schema.org/description> "\n <p xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:schema=\"http://schema.org/\">foo</p>\n <div rel=\"schema:hasPart\" resource=\"#bar\" xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:schema=\"http://schema.org/\">\n \n </div>\n"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
<https://dokie.li/tmp/test.html> <http://schema.org/hasPart> <https://dokie.li/tmp/test.html#bar> .
Expected ( from http://rdf.greggkellogg.net/distiller ):
<http://example.org/> <http://schema.org/description> "\n <p>foo</p>\n <div rel=\"schema:hasPart\" resource=\"#bar\">\n <p property=\"schema:description\" datatype=\"rdf:HTML\"><span>bar</span></p>\n </div>\n"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
<http://example.org/> <http://schema.org/hasPart> <http://example.org/#bar> .
<http://example.org/#bar> <http://schema.org/description> "<span>bar</span>"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
Note the missing markup and content inside of the div (\n <p property=\"schema:description\" datatype=\"rdf:HTML\"><span>bar</span></p>\n )
Is this a bug in rdf-ext / rdfa-streaming-parser, or does the issue perhaps lie on our end somehow? It'd be great if you can preproduce / confirm.
All of the namespaces from the body's prefixes is getting inherited in every tag. At first we wanted to post-process just the xmlns but that's just actually one among many. I understand that the rdf:HTML literal may be more (XML) well-formed that way but I'm not sure if that's a spec requirement. (The distiller doesn't seem to pass the namespaces down, but then again, perhaps that's wrong and rdfa-streaming-parser is correct).
The spec is not really explicit about this behaviour of namespaces being included (AFAICS), but the RDFa spec tests seem to require it, which is why I implemented it like this.
I understand the need for disabling this behaviour though, so a config option would be good to have indeed.
Aside from the namespaces, what is more concerning is the missing markup and content. Would you be able to confirm whether this is a bug in the parser?
Would you be able to confirm whether this is a bug in the parser?
That seems to be a bug indeed. Might be good to create a separate issue for that so we certainly track it as well.
Probably related to the resource="#bar" there.
Thanks! I created https://github.com/rubensworks/rdfa-streaming-parser.js/issues/67 .