miller
miller copied to clipboard
Failure resilience
Hello,
Is there any way to use mlr
such that malformed lines are silently dropped instead of halting the command for the entire file?
Thanks @dandrei and @masgo!
Are you concerned about a particular format (CSV, JSON) or all of them? (This is a great ask regardless, just thinking through it in advance)
I never encountered broken JSON. If there was a problem with JSON, it was always encoding-related and could be solved with running it through iconv
before running it through miller.
But CSV are like always -somewhat- broken. Using miller cut has become a common thing for me in order to have well-formed and consistent CSVs.
Having leading/trailing lines with some kind of output status/summary from the previous program is quite common. Can be solved by using head
and tail
, but having miller drop them automatically would be nice. For situations where there are multiple small CSVs, one could then simply cat
them all together and continue with miller instead of looping through all and removing the broken lines beforehand.
Back to the main topic: while dropping silently is nice for many applications, a raw output of the broken lines (or just the line numbers) to a secondarry error file (or to stderr) would be even better.
Thank you for the replies. I agree with what @masgo wrote.
The use cases that prompted me to start this topic involve working with large files that contain comma-separated data, where a handful of lines (from among tens or hundreds of millions) are broken in hard-to-debug and quirky ways (weirdly escaped quotes that result in unterminated cells, mismatched column count, mixed encodings, etc.).
Some assumed and opt-in information loss (that results in a good-enough,-that's-all-we-needed-to-know results) would be preferred to having to manually sanitize these files beforehand.
@masgo @dandrei
preferred to having to manually sanitize these files beforehand
This is spot-on.
One of Miller's reasons-for-being is data-cleaning so it shouldn't be necessary to pre-cleaner for the cleaner :^/
At the time I implemented the RFC-compliant CSV parser some years ago I took a rather hard-line approach to file-formatting -- operating on some (well-intended) feedback that format-compliant tooling incentivizes people to produce format-compliant data, and that easing off the former encourages slippery-slope behavior for the latter. And that feedback was well-informed for the person who gave it to me, and their data sources (more human-produced). However ... lots of other people's data (like yours, and sometimes mine although at present my formatting challenges are around MongoDB log-lines which are a totally different issue :) ) has machine-generated noise and my playing the data-purity card isn't helpful in that context.
For CSV, at a technical level, it's a matter of:
- UX level:
- The opt-in flag
- A place to send malformed data (presumably stderr)
- Implementation internals:
- Detecting the parse error
- Finding the end of line
- Continuing from there
Parse errors are of two forms: one is header/data length mismatch which is easy to send to stderr and not abend on; the other involves imbalanced double-quotes which may be a little trickier to make reliable but I will check it out.
Thank you!! :D
@dandrei @masgo more info from working on https://github.com/johnkerl/miller/tree/mlr-k:
- Note I'm trialing this in the Go port but will add this feature to the C implementation as well
- An easy win is file-not-found in
mlr --whatever a.csv b.csv c.csv
-- ifb.csv
is missingmlr -k
will continue; this is good. - For CSV
mlr -k
will continue on header/data length mismatch errors; this is good. - For double-quote imbalances where there's an odd number of double quotes, the CSV parser will keep looking for the closing double-quote on subsequent lines ... in fact it is its job to do so :( -- this is not so good and I don't see a reasonable way around it.
For CSV
mlr -k
will continue on header/data length mismatch errors; this is good.
Isn't this the same as --allow-ragged-csv-input
?
* For double-quote imbalances where there's an odd number of double quotes, the CSV parser will keep looking for the closing double-quote on subsequent lines ... in fact it is its job to do so :( -- this is not so good and I don't see a reasonable way around it.
This is truly a difficult topic. For most of my cases it boils down to the questions: "Do I expect multi-line strings in any field?" and "could the CSV be ragged?" Most of the time the answer to both is "no". Then I can deduce from the fact that miller (or Excel) encounters one of these things that there is some quotation-error in the data.
The ragged part is sometimes a bit difficult, because some programs produce CSVs in the style of:
a,b,c,d
1,2,3,4,
i.e., the data lines do have an (unncessary) separatation character at the end, while the header does not.
Sometimes, it also helps to check that the columns after the last string column contain valid values (int, float, etc.). If there are any empty or zero fields -> probably some quotation error before that.
So maybe it would be useful if miller would (optionally) output the oddities it encountered, like: multi-line strings, ragged data, etc.
Thanks @masgo!
Isn't this the same as --allow-ragged-csv-input ?
Not quite. With --allow-ragged-csv-input
the rows are accepted as data; with -k
they'll be printed to stderr.
So maybe it would be useful if miller would (optionally) output the oddities it encountered, like: multi-line strings, ragged data, etc.
Worth thinking about! :)
Could multi-line strings be opted-out of with the use of a flag? e.g. when the user specifies something like --disallow-multiline-strings
, treat lines with unclosed quotes as broken.