hdt-cpp
hdt-cpp copied to clipboard
rdf2hdt parse error (despite the file parses with serdi)
when I just tried to create the latesst wikidata dump hdt I stumbled over the following triple:
----------- onliner.nt ------------
<http://www.wikidata.org/reference/250da9edffc9625b588245400ab612129878c232> <http://www.wikidata.org/prop/reference/P854> <www.stat.gov.pl/broker/access/performSearch.jspa?searchString=Janowo&level=miejsc&wojewodztwo=2222&powiat=6381&gmina=&miejscowosc=&advanced=true> .
When trying rdf2hdt here, I get the following:
$ rdf2hdt oneliner.nt onliner.hdt error: oneliner.nt:1:140: bad IRI scheme char `2F' Catch exception load: Error parsing input. ERROR: Error parsing input.
despite serd seems to swallow it:
$ serdi oneliner.nt
works...
Any ideas where in the code I could try looking for a solution?
Thanks, Axel
as a quickfix... is there any option in the nt parser that could be used to just skip to the next line and ignore parse errors per line for nt-input or where/how could I add such an option?
Hi Axel, could you be using an older version of Serd than the one that was compiled with HDT? The correct parsing of the IRI scheme component was added relatively recently to Serd. What's your output for serdi -v
?
Notice that there is an option in Serd/Serdi to use lax parsing (-l
), but it is probably not exposed through the HDT library ATM. Still, you can try lax parsing to a temporary file, and generate an HDT out of that temporary file as a workaround.
(And don't forget to email the Wikidata maintainers to explain them that publishing an absolute IRI with no valid scheme component is not the Pedantic Way.)
Hi Axel, could you be using an older version of Serd than the one that was compiled with HDT? The correct parsing of the IRI scheme component was added relatively recently to Serd. What's your output for serdi -v?
0.28
On 22.07.2019, at 16:44, Wouter Beek [email protected] wrote:
Hi Axel, could you be using an older version of Serd than the one that was compiled with HDT? The correct parsing of the IRI scheme component was added relatively recently to Serd. What's your output for serdi -v?
Notice that there is an option in Serd/Serdi to use lax parsing (-l), but it is probably not exposed through the HDT library ATM.
that could be a workaround, would be nice to expose this also in rdf2hdt if that works, any idea where I need to start looking to add that...?
Still, you can try lax parsing to a temporary file, and generate an HDT out of that temporary file as a workaround.
... hmmm, I thought I tried that, but need to check again.
(And don't forget to email the Wikidata maintainers to explain them that publishing an absolute IRI with no valid scheme component is not the Pedantic Way.)
good point...
Axel
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
p.s.: as suspected by Wouter... lax parsing would remedy this issue (justs tested on a local machine with serdi 0.30) ... so, I'd serioudly opt for adding the lax parsing option to rdf2hdt.
Looking at http://drobilla.net/docs/serd/ which says that non-strict parsing is set by default, so I am a bit confused now, not finding anyway in the code a call to
serd_reader_set_strict( ... )
hmmm, any help/hints welcome, I have to admit I don't really understand the serd interface and how it is called by hdt... I suspect, within libhdt/src/rdf/RDFParserSerd.cpp but again, even if I add
serd_reader_set_strict( reader, false );
there, it doesn't change anything.... or does that interfer with the call to
serd_reader_set_error_sink
??
The character is the /
since that "URI" does indeed have no scheme (so this isn't valid NTriples), but I'm not sure why you would be seeing different behaviour here. I get a failure with serdi
(current master), lax or not:
$ serdi -l ./test.nt
error: ./test.nt:1:140: bad IRI scheme char `2F'
Note that lax parsing is not a free lunch: it can drop triples. So enabling it by default might not be the best idea. Surely the web has suffered enough under that philosophy? :)
FWIW, I wasn't arguing for enabling lax parsing by default, but it still might be worthwhile to have the option.
(anyway, managed to create a new wikidata HDT dump in the meanwhile, but sitll looking for where to host it (88GB HDT) ;-))
On 01.08.2019, at 21:23, David Robillard [email protected] wrote:
Note that lax parsing is not a free lunch: it can drop triples. So enabling it by default might not be the best idea. Surely the web has suffered enough under that philosophy? :)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
So after a short talk with @AxelPolleres, I was curious and gave the oneliner.ttl
a try with serd 0.30.1. I just want to confirm that
- the lax parsing from 0.30 on solves the issue
- that it is enabled by default, probably because the option wasn't there before and serd enables it by default.
So how to proceed?
If we go for strict-by default, we need to add the serd_reader_set_strict( reader, true );
in libhdt/src/rdf/RDFParserSerd.cpp
for version >= 30 only.
However, what I don't understand is why I don't need the -l
when using serdi
, it parses just fine without any bad IRI scheme char '2F
(which will give you trouble in HDT afterwards anyway). It also seems that the lax parsing has been around for way longer (~0.21.1 -> https://github.com/drobilla/serd/commit/d51be9b8d97791bff796d046d10fe16fd4e41311).
So it seems there are two things going on here:
- the
serd_reader_set_strict
added in https://github.com/drobilla/serd/commit/d51be9b8d97791bff796d046d10fe16fd4e41311 protects against invalid characters, which is not causing the error inoneliner.ttl
and that's why it doesn't change anything - between 0.28 and 0.30, there was a change that allowed URIs without protocols to be parsed as being correct.
@drobilla could you provide some insight here?
@mielvds I think the discrepancy is because you are parsing it as .ttl
there (serdi deduces the type from the extension if you don't provide it explicitly). This is different for Turtle and NTriples since it could be a URIRef in Turtle, but in NTriples it must be a URI (with a scheme).