fastcsv icon indicating copy to clipboard operation
fastcsv copied to clipboard

Allow null / zero bytes in CSVs

Open bradleybuda opened this issue 2 years ago • 2 comments

When an input CSV contains the 0 / null byte, FastCSV emits two rows for that row, both "incorrect":

null_csv = "one,two\nthr\00ee,four"
FastCSV.raw_parse(null_csv) { |row| p row }

Outputs:

["one", "two"]
["thr"]
["thr", "thr\u0000ee", "four"]

Quoting the cell causes an error instead:

quoted_null_csv = %Q["one","two"\n"thr\00ee","four"]
FastCSV.raw_parse(quoted_null_csv) { |row| p row }

Outputs:

["one", "two"]
Traceback (most recent call last):
        1: from (irb):6
FastCSV::MalformedCSVError (Unclosed quoted field on line 2.)

It's my understanding that the null byte is legal in UTF-8 strings - Ruby and other (non-C) languages can certainly handle it cleanly, so I'd love if FastCSV could handle it as well. I can try to put together a patch but I have zero Ragel knowledge, so not sure how easy / difficult it will be.

bradleybuda avatar Jan 25 '22 22:01 bradleybuda

I can confirm that the Ruby CSV library doesn't have this issue:

require 'csv'

CSV.parse("one,two\nthr\00ee,four")

produces:

[["one", "two"], ["thr\u0000ee", "four"]]

That said, I haven't kept this library up-to-date with recent Ruby versions (I haven't tested to find out).

Unless you really need those NUL characters, it might be simplest to wrap the input stream to remove the NUL characters: for example, by overriding the read method of whatever superclass you're using.

If you do need the NUL characters, I guess you can do the same as above, except replace them with a sentinel value, then do post-processing to return them to NUL characters.

I agree that this is a bug, but I'm unlikely to fix it.

jpmckinney avatar Jan 31 '22 18:01 jpmckinney

Thank you for the quick reply! In our case we'd like to preserve the NUL values - the idea of a sentinel is a good one. Appreciate the work you've done on this library - it has been very useful to my team.

bradleybuda avatar Jan 31 '22 18:01 bradleybuda