rust-csv
rust-csv copied to clipboard
Support multi-byte terminator
I'm parsing some CSV files that use the "record delemiter"/terminator \x02\n
. I can't use \n
as terminator, since \n
may occur in record field values.
Another bug is that I don't get any records at all if I choose \x02
as the terminator. My suspicion is that this library always does something special for \n
even if it's not set as the terminator.
I'm parsing some CSV files that use the "record delemiter"/terminator
\x02\n
. I can't use\n
as terminator, since\n
may occur in record field values.
If \n
may occur in field values, then those values should be quoted. You don't need a multi-byte delimiter for this.
My suspicion is that this library always does something special for
\n
even if it's not set as the terminator.
This is not the intent. Please file a bug. Please include a full program and the input, along with the expected output and the actual output.
Here's an example: https://gist.github.com/joar/4335726e8172535f69eef131ec6135af.
Good catch, there is indeed a bug in the parser. Namely, this is the transition for comments in CSV data:
InComment => {
if b'\n' == c {
(StartRecord, NfaInputAction::Discard)
} else {
(InComment, NfaInputAction::Discard)
}
}
In other words, the \n
terminator is hard-coded. This needs to be tweaked a bit to use the configured terminator, which is \x02
in your case.
If I remove the comments from your file, then I get:
Reading using terminator: 0x2
StringRecord(["1527066000035", "143508", "DOM", "Dominican Republic"])
StringRecord(["\n1527066000035", "143509", "ECU", "Ecuador"])
StringRecord(["\n1527066000035", "143510", "HND", "Honduras"])
StringRecord(["\n1527066000035", "143511", "JAM", "Jamaica"])
StringRecord(["\n1527066000035", "143512", "NIC", "Nicaragua"])
StringRecord(["\n1527066000035", "143513", "PRY", "Paraguay"])
StringRecord(["\n"])
This is what I'd expect. Your data is formatted in a weird way. In particular, you're seemingly using \x02\n
as your record terminator, but you should just be using \x02
. And when you go down that path, all of a sudden comments aren't going to be particularly helpful unless you have an editor specifically designed to handle \x01
and \x02
as field and record separators.
This is why people just stick to the standard delimiters. It's much easier.
I agree regarding the delimiter, unfortunately my source for the CSV is Apple's Enterprise Partner Feed. I guess I could run the file through sed 's/\x02\n/\x02/g
before parsing.
I met a similar problem.
I got a CSV from a client where the separator is |
, line terminator is \r\n
, there's no escape character, double quotes "
can be used for fields containing \r\n
. However, fields can contain unquoted \n
since it is not a separator.
@givors-anabasis This crate can't handle that data then. You'll need to find some other way.
All right, too bad, thanks for the reply @BurntSushi