kantan.csv icon indicating copy to clipboard operation
kantan.csv copied to clipboard

Include the line number of an error when parsing

Open nrinaudo opened this issue 5 years ago • 7 comments

It'd be nice for errors to come tagged with the row number at which they arrive.

Keeping track of this is fairly straightforward, we need only keep a counter in CsvReader, but adding that information to errors is trickier, as errors currently don't really have customisable error messages.

nrinaudo avatar May 25 '19 19:05 nrinaudo

Along with the line number(s), it would be useful to have the actual line(s) which failed to read. This would make it easier to implement error recovery mechanisms like sending the unreadable CSV somewhere for manual intervention.

This can be achieved right now by calling readCsvRow one line at a time, but that won't handle CSV rows which take up multiple lines due to quoting.

seanf avatar Jun 04 '19 00:06 seanf

Why wouldn't it handle CSV rows which take up multiple lines? Can you show what you mean by that? If it's true, it might be an entirely different, fairly critical issue

nrinaudo avatar Jun 04 '19 15:06 nrinaudo

I'll try to explain:

Let's say I'm parsing a file like this, which contains five lines but only three records (with two fields: a String and an Int):

abc,123
def,456
ghi,"7
8
9"

Lines 0 and 1 parse okay, so I process them immediately.

The last three lines constitute one record because of the quotes, and maybe they fail because 7\n8\n9 is not a valid number. So I want to extract that record for manual processing. To do that, I need (a) the unparseable record's line numbers (2 - 4) (so I can extract those lines from the original source, if I still have it), or (b) a copy of the record which couldn't be parsed:

ghi,"7
8
9"

I'm dealing with a case like this right now, but I've had to make the assumption that there are never any quotes (or any multi-line records). Thus I can split the file/stream into lines and parse the lines one at a time, converting each line to Either[String, MyCaseClass] (where a String is a bad line, and MyCaseClass is a successfully parsed record). But this code could be simpler if the parser simply returned the offending line(s), plus it would work with quotes and multi-line records.

seanf avatar Jun 25 '19 10:06 seanf

Kantan.csv will absolutely allow you to do that. Provide a CellDecoder[Int] that ignores whitespace and this will work. It’s not the default behaviour, because the default behaviour is to only accept valid ints as ints, but you absolutely are allowed to override what a valid int is for your specific csv.

nrinaudo avatar Jun 25 '19 11:06 nrinaudo

@nrinaudo That's just a made-up, simplified example. It could equally have been a multiline alpha string in a field where an Int was expected.

My real world case involves a literal string value (sort of a record type) which is being converted to an enum (well, a case object) by a CellDecoder, and what to do if an unexpected string shows up. But in general, I'm thinking of any sort of unexpected cell value.

seanf avatar Jun 25 '19 11:06 seanf

understood. I still think using the error itself as a recovery mechanism is a bad idea though. It’d be far better to provide the error recovery in the decoders themselves, so that you can turn your illegal values into legal ones, or into, say, an Either[String, A] where he string is the bad cell and the A the decoded good cell.

This is how I meant kantan.csv to deal with the scenarios you describe, because I think that’s the best possible way of dealing with them, and unless you can convince me what you have in mind is better, I don’t think I’ll bother implementing inferior alternatives to what is already supported.

Feel free to take a stab at convincing me though, just in a different issue, because I feel this is only tenuously related to error messages including line numbers.

nrinaudo avatar Jun 25 '19 12:06 nrinaudo

@nrinaudo That's a good point about having the decoder return Either[String, A], but for this sort of error recovery I need the entire line (or multi-line record) that failed, not just the field which couldn't be parsed into the target type. Moving discussion to #188.

seanf avatar Jun 25 '19 23:06 seanf