modin
modin copied to clipboard
BUG: Parallel CSV reader does not account for escaping quotes when determining lines for reading
Current approach of splitting the work relies on determining where a line ends by assuming that a symbol is not inside a field if the amount of quotes to the left of it is even (so if line end symbol is after, say, 4 quotes it's not a part of a field).
This assumption is wrong because you can escape a quote symbol, and there are two common ways of doing it. One is to use two double quotes inside a quoted field (which works nicely), and another is to use a backslash followed by a double quote which breaks the assumption.
@vnlitvinov do you have a reproducer for this?
I don't, as the issue is quite old. IIRC it emerged from reviewing some pull request.
The example would be a line in a csv file being, say,
"field a", "field b with a \" symbol", "field c"
and, in an unfortunate case that our splitting gets in the middle of that line all would break.
That is, our quoting logic is based on assumption that quotes in CSV are escaped by doubling them (i.e. their number is always even on a given line), but there could be cases where their amount is odd.