rust-csv icon indicating copy to clipboard operation
rust-csv copied to clipboard

Support multi-byte terminator

Open joar opened this issue 6 years ago • 9 comments

I'm parsing some CSV files that use the "record delemiter"/terminator \x02\n. I can't use \n as terminator, since \n may occur in record field values.

Another bug is that I don't get any records at all if I choose \x02 as the terminator. My suspicion is that this library always does something special for \n even if it's not set as the terminator.

joar avatar Jun 04 '18 13:06 joar

I'm parsing some CSV files that use the "record delemiter"/terminator \x02\n. I can't use \n as terminator, since \n may occur in record field values.

If \n may occur in field values, then those values should be quoted. You don't need a multi-byte delimiter for this.

My suspicion is that this library always does something special for \n even if it's not set as the terminator.

This is not the intent. Please file a bug. Please include a full program and the input, along with the expected output and the actual output.

BurntSushi avatar Jun 04 '18 13:06 BurntSushi

Here's an example: https://gist.github.com/joar/4335726e8172535f69eef131ec6135af.

joar avatar Jun 05 '18 07:06 joar

Good catch, there is indeed a bug in the parser. Namely, this is the transition for comments in CSV data:

            InComment => {
                if b'\n' == c {
                    (StartRecord, NfaInputAction::Discard)
                } else {
                    (InComment, NfaInputAction::Discard)
                }
            }

In other words, the \n terminator is hard-coded. This needs to be tweaked a bit to use the configured terminator, which is \x02 in your case.

If I remove the comments from your file, then I get:

Reading using terminator: 0x2
StringRecord(["1527066000035", "143508", "DOM", "Dominican Republic"])
StringRecord(["\n1527066000035", "143509", "ECU", "Ecuador"])
StringRecord(["\n1527066000035", "143510", "HND", "Honduras"])
StringRecord(["\n1527066000035", "143511", "JAM", "Jamaica"])
StringRecord(["\n1527066000035", "143512", "NIC", "Nicaragua"])
StringRecord(["\n1527066000035", "143513", "PRY", "Paraguay"])
StringRecord(["\n"])

This is what I'd expect. Your data is formatted in a weird way. In particular, you're seemingly using \x02\n as your record terminator, but you should just be using \x02. And when you go down that path, all of a sudden comments aren't going to be particularly helpful unless you have an editor specifically designed to handle \x01 and \x02 as field and record separators.

This is why people just stick to the standard delimiters. It's much easier.

BurntSushi avatar Jun 05 '18 11:06 BurntSushi

I agree regarding the delimiter, unfortunately my source for the CSV is Apple's Enterprise Partner Feed. I guess I could run the file through sed 's/\x02\n/\x02/g before parsing.

joar avatar Jun 12 '18 07:06 joar

I met a similar problem. I got a CSV from a client where the separator is |, line terminator is \r\n, there's no escape character, double quotes " can be used for fields containing \r\n. However, fields can contain unquoted \n since it is not a separator.

fabien-anabasis avatar Feb 10 '22 13:02 fabien-anabasis

@givors-anabasis This crate can't handle that data then. You'll need to find some other way.

BurntSushi avatar Feb 10 '22 13:02 BurntSushi

All right, too bad, thanks for the reply @BurntSushi

fabien-anabasis avatar Feb 11 '22 08:02 fabien-anabasis