modin icon indicating copy to clipboard operation
modin copied to clipboard

BUG: Parallel CSV reader does not account for escaping quotes when determining lines for reading

Open vnlitvinov opened this issue 4 years ago • 2 comments

Current approach of splitting the work relies on determining where a line ends by assuming that a symbol is not inside a field if the amount of quotes to the left of it is even (so if line end symbol is after, say, 4 quotes it's not a part of a field).

This assumption is wrong because you can escape a quote symbol, and there are two common ways of doing it. One is to use two double quotes inside a quoted field (which works nicely), and another is to use a backslash followed by a double quote which breaks the assumption.

vnlitvinov avatar Feb 09 '21 23:02 vnlitvinov

@vnlitvinov do you have a reproducer for this?

pyrito avatar Aug 23 '22 17:08 pyrito

I don't, as the issue is quite old. IIRC it emerged from reviewing some pull request.

The example would be a line in a csv file being, say,

"field a", "field b with a \" symbol", "field c"

and, in an unfortunate case that our splitting gets in the middle of that line all would break.

That is, our quoting logic is based on assumption that quotes in CSV are escaped by doubling them (i.e. their number is always even on a given line), but there could be cases where their amount is odd.

vnlitvinov avatar Aug 24 '22 08:08 vnlitvinov