fastcsv
fastcsv copied to clipboard
Allow null / zero bytes in CSVs
When an input CSV contains the 0 / null byte, FastCSV emits two rows for that row, both "incorrect":
null_csv = "one,two\nthr\00ee,four"
FastCSV.raw_parse(null_csv) { |row| p row }
Outputs:
["one", "two"]
["thr"]
["thr", "thr\u0000ee", "four"]
Quoting the cell causes an error instead:
quoted_null_csv = %Q["one","two"\n"thr\00ee","four"]
FastCSV.raw_parse(quoted_null_csv) { |row| p row }
Outputs:
["one", "two"]
Traceback (most recent call last):
1: from (irb):6
FastCSV::MalformedCSVError (Unclosed quoted field on line 2.)
It's my understanding that the null byte is legal in UTF-8 strings - Ruby and other (non-C) languages can certainly handle it cleanly, so I'd love if FastCSV could handle it as well. I can try to put together a patch but I have zero Ragel knowledge, so not sure how easy / difficult it will be.
I can confirm that the Ruby CSV library doesn't have this issue:
require 'csv'
CSV.parse("one,two\nthr\00ee,four")
produces:
[["one", "two"], ["thr\u0000ee", "four"]]
That said, I haven't kept this library up-to-date with recent Ruby versions (I haven't tested to find out).
Unless you really need those NUL characters, it might be simplest to wrap the input stream to remove the NUL characters: for example, by overriding the read
method of whatever superclass you're using.
If you do need the NUL characters, I guess you can do the same as above, except replace them with a sentinel value, then do post-processing to return them to NUL characters.
I agree that this is a bug, but I'm unlikely to fix it.
Thank you for the quick reply! In our case we'd like to preserve the NUL values - the idea of a sentinel is a good one. Appreciate the work you've done on this library - it has been very useful to my team.