hdt-cpp icon indicating copy to clipboard operation
hdt-cpp copied to clipboard

Add option to ignore error instead of throwing error

Open mhoangvslev opened this issue 2 years ago • 5 comments

While working with dirty data, I realised that being able to skip bad rows when parsing RDF is very useful. This feature is suggested in issue #117 but was met with strong opposition. I would like to bring that up once more time, in hope that mentality might have changed since.

The program should give the option to warn-instead-of-error for these reasons:

  1. I know that the errors is minor and am willing to drop those faulty triples.
  2. I want to go all the way through first, get the list off all line with error, bulk-edit my huge RDF (579GB) instead of fix it one by one. When the faulty triples are at the end of the file, it's just painful and takes a lot of dev-time.

mhoangvslev avatar Mar 23 '22 00:03 mhoangvslev

I think this is something for the SERD parser, rather than HDT, no?

mielvds avatar Mar 23 '22 11:03 mielvds

From the user's pov, I don't see the option for it. Can you give me hint?

mhoangvslev avatar Mar 23 '22 11:03 mhoangvslev

serd already has a lax parsing mode for roughly this purpose, although (as you might expect) things can go horribly wrong with syntactically invalid Turtle or TriG documents and drop a ton of data on the floor. It works fine for line-based formats like NTriples and NQuads though.

drobilla avatar Mar 31 '22 17:03 drobilla

Let's consider my second point. I am willing to fix the bug and I want to have the list of the bugs to fix instead of launch-fix-launch.

mhoangvslev avatar Mar 31 '22 17:03 mhoangvslev

@mhoangvslev You could use serdi on the command line to strip the bad triples out yourself before loading it. It uses the same parser, so should encounter the same errors as hdt-cpp but be much quicker to use as a tool for this. With lax parsing (-l) it should print all the errors encountered in one run.

I usually do this from a text editor with a compilation mode that understands GCC warning syntax (vim, emacs, etc etc) so you can jump immediately to each error.

drobilla avatar Mar 31 '22 20:03 drobilla