hdt-cpp
hdt-cpp copied to clipboard
Add option to ignore error instead of throwing error
While working with dirty data, I realised that being able to skip bad rows when parsing RDF is very useful. This feature is suggested in issue #117 but was met with strong opposition. I would like to bring that up once more time, in hope that mentality might have changed since.
The program should give the option to warn-instead-of-error for these reasons:
- I know that the errors is minor and am willing to drop those faulty triples.
- I want to go all the way through first, get the list off all line with error, bulk-edit my huge RDF (579GB) instead of fix it one by one. When the faulty triples are at the end of the file, it's just painful and takes a lot of dev-time.
I think this is something for the SERD parser, rather than HDT, no?
From the user's pov, I don't see the option for it. Can you give me hint?
serd already has a lax parsing mode for roughly this purpose, although (as you might expect) things can go horribly wrong with syntactically invalid Turtle or TriG documents and drop a ton of data on the floor. It works fine for line-based formats like NTriples and NQuads though.
Let's consider my second point. I am willing to fix the bug and I want to have the list of the bugs to fix instead of launch-fix-launch.
@mhoangvslev You could use serdi
on the command line to strip the bad triples out yourself before loading it. It uses the same parser, so should encounter the same errors as hdt-cpp but be much quicker to use as a tool for this. With lax parsing (-l
) it should print all the errors encountered in one run.
I usually do this from a text editor with a compilation mode that understands GCC warning syntax (vim, emacs, etc etc) so you can jump immediately to each error.