CSV.jl icon indicating copy to clipboard operation
CSV.jl copied to clipboard

Bug when parsing complex CSV with multi-threading enabled

Open bkamins opened this issue 2 years ago • 3 comments

The problem is described in https://discourse.julialang.org/t/error-task-failed-exception-reading-csv/86544. Most likely the cause of the problem is that the file has multi-line fields that are wrapped in ".

bkamins avatar Aug 30 '22 12:08 bkamins

The dataset includes a long text in which the " character is escaped by a \. Using escapechar='\\' in the options solves this issue, so I think it's not a bug, maybe just a discoverability issue with that option?

Liozou avatar Feb 12 '23 13:02 Liozou

I have checked this and using escapechar='\\' does not solve the issue. Also note that single threaded the file is read correctly just when using df = CSV.read("DataEngineer.csv", DataFrame, ntasks=1).

Also I have checked that indeed there is an issue with embedded " characters, but an example of such situation is:

"Job Description
<here I cut out irrelevant multi-line input>
applicable state and local \""Fair Chance\"" laws."

and setting escapechar='\\' leads to errors. The default setting escapecha='"' seems correct as then it just gets parsed as local \"Fair Chance\" laws which is maybe not ideal, but at least correctly respects the field delimiter.

bkamins avatar Feb 12 '23 14:02 bkamins

My bad, you are absolutely right, escapechar='\\' is simply wrong here. Apologies for the noise!

Liozou avatar Feb 12 '23 19:02 Liozou