kantan.csv
kantan.csv copied to clipboard
Include the line number of an error when parsing
It'd be nice for errors to come tagged with the row number at which they arrive.
Keeping track of this is fairly straightforward, we need only keep a counter in CsvReader
, but adding that information to errors is trickier, as errors currently don't really have customisable error messages.
Along with the line number(s), it would be useful to have the actual line(s) which failed to read. This would make it easier to implement error recovery mechanisms like sending the unreadable CSV somewhere for manual intervention.
This can be achieved right now by calling readCsvRow
one line at a time, but that won't handle CSV rows which take up multiple lines due to quoting.
Why wouldn't it handle CSV rows which take up multiple lines? Can you show what you mean by that? If it's true, it might be an entirely different, fairly critical issue
I'll try to explain:
Let's say I'm parsing a file like this, which contains five lines but only three records (with two fields: a String
and an Int
):
abc,123
def,456
ghi,"7
8
9"
Lines 0 and 1 parse okay, so I process them immediately.
The last three lines constitute one record because of the quotes, and maybe they fail because 7\n8\n9
is not a valid number. So I want to extract that record for manual processing. To do that, I need (a) the unparseable record's line numbers (2 - 4) (so I can extract those lines from the original source, if I still have it), or (b) a copy of the record which couldn't be parsed:
ghi,"7
8
9"
I'm dealing with a case like this right now, but I've had to make the assumption that there are never any quotes (or any multi-line records). Thus I can split the file/stream into lines and parse the lines one at a time, converting each line to Either[String, MyCaseClass]
(where a String
is a bad line, and MyCaseClass
is a successfully parsed record). But this code could be simpler if the parser simply returned the offending line(s), plus it would work with quotes and multi-line records.
Kantan.csv will absolutely allow you to do that. Provide a CellDecoder[Int]
that ignores whitespace and this will work. It’s not the default behaviour, because the default behaviour is to only accept valid ints as ints, but you absolutely are allowed to override what a valid int is for your specific csv.
@nrinaudo That's just a made-up, simplified example. It could equally have been a multiline alpha string in a field where an Int was expected.
My real world case involves a literal string value (sort of a record type) which is being converted to an enum (well, a case object) by a CellDecoder, and what to do if an unexpected string shows up. But in general, I'm thinking of any sort of unexpected cell value.
understood. I still think using the error itself as a recovery mechanism is a bad idea though. It’d be far better to provide the error recovery in the decoders themselves, so that you can turn your illegal values into legal ones, or into, say, an Either[String, A]
where he string is the bad cell and the A the decoded good cell.
This is how I meant kantan.csv to deal with the scenarios you describe, because I think that’s the best possible way of dealing with them, and unless you can convince me what you have in mind is better, I don’t think I’ll bother implementing inferior alternatives to what is already supported.
Feel free to take a stab at convincing me though, just in a different issue, because I feel this is only tenuously related to error messages including line numbers.
@nrinaudo That's a good point about having the decoder return Either[String, A]
, but for this sort of error recovery I need the entire line (or multi-line record) that failed, not just the field which couldn't be parsed into the target type. Moving discussion to #188.