CSV.jl Threaded parsing type mismatch depending on `ntasks` value

Problem

We've ran across a very odd issue where depending on the value of ntasks set the type of parsed is different.

CSV.File("foobar.csv"; ntasks=60, debug=true)
...
types after parsing: Type[Float64], pool = (0.2, 500)

CSV.File("foobar.csv"; ntasks=120, debug=true)
types after parsing: Type[String31], pool = (0.2, 500)

File to replicate the issue, foobar.csv

Jun 28 '22 16:06 mattBrzezinski

Additional fact from the original observed file: this column was the right-most column but there were several columns of various types including Float64, string, and Int64, and only this column had issues.

Jun 28 '22 19:06 iamed2

I think i know what's happening, but not what to do about it.

when we chunk up the file in the ntasks=120 case we get unlucky and end up with the last 2 bytes in the chunk being a newline and a leading negation sign.

it's a bit like trying to parse a file that looks like

1.0
2.0
-3.0
4.0
5.0

(i.e. 1.0\n2.0\n-3.0\n4.0\n5.0) by parsing it in two chunks: 1.0\n2.0\n- and 3.0\4.0\n5.0.

When we try to parse that first chunk as Float64 values (as we do as part of detect), we're going to parse 1.0, then 2.0 then try to parse - as a Float64, which is going to fail, so detect decides on this basis that the column isn't Float64s afterall (and falls back to a parsing the column as a string)

Jun 28 '22 20:06 nickrobinson251

Hmmmm.....I don't think that should be possible because we do extra work to ensure chunks only get split exactly on the newline character, so \n should always end a chunk and the next character would be the start of the next chunk.

Jun 28 '22 21:06 quinnj

hmmmm

well, i'm very curious to find out what is going on 😂

the debug output is

julia> CSV.File("foobar.csv"; ntasks=120, debug=true)
header is: 1, skipto computed as: 2
headerpos = 1, datapos = 8
estimated rows: 3598
detected delimiter: ","
column names detected: [:foobar]
byte position of data computed at: 8
computed types are: nothing
initial byte positions before adjusting for start of rows: [8, 520, 1032, 1544, 2056, 2568, 3080, 3592, 4104, 4616, 5128, 5640, 6152, 6664, 7176, 7688, 8200, 8712, 9224, 9736, 10248, 10760, 11272, 11784, 12296, 12808, 13320, 13832, 14344, 14856, 15368, 15880, 16392, 16904, 17416, 17928, 18440, 18952, 19464, 19976, 20488, 21000, 21512, 22024, 22536, 23048, 23560, 24072, 24584, 25096, 25608, 26120, 26632, 27144, 27656, 28168, 28680, 29192, 29704, 30216, 30728, 31240, 31752, 32264, 32776, 33288, 33800, 34312, 34824, 35336, 35848, 36360, 36872, 37384, 37896, 38408, 38920, 39432, 39944, 40456, 40968, 41480, 41992, 42504, 43016, 43528, 44040, 44552, 45064, 45576, 46088, 46600, 47112, 47624, 48136, 48648, 49160, 49672, 50184, 50696, 51208, 51720, 52232, 52744, 53256, 53768, 54280, 54792, 55304, 55816, 56328, 56840, 57352, 57864, 58376, 58888, 59400, 59912, 60424, 60936, 61526]
something went wrong chunking up a file for multithreaded parsing, falling back to single-threaded parsing
time for initial parsing: 5.262593030929565
types after parsing: Type[String31], pool = (0.2, 500)

I think a few funny things are going on here:

multi-threaded parsing fails... and i'm curious why.
- i thought this usually only happened when we got unlucky with quoted columns, and there's no quoted data here... but maybe my understanding is wrong and multi-threaded parsing is known to fail in other cases
multi-threaded parsing detects String31 (and fails)
- i'm not sure if this is the same as the point above or not, i.e. the incorrect type detection and the failure are one and the same thing
- i think what happens is we somehow get detect being passed pos == len and we're trying to parse a single character that happens to be the - character (which is then why detect chooses a string type)... but how we get here i'm not sure
i think we're parsing things as a String31 because this is what is detected by multi-threaded parsing... but multi-threaded parsing fails and yet we still used the detected String31 type for single-threaded parsing
- should we reset the column types (to NeedsTypeDetection for columns where the type wasn't user-given) if multi-threaded parsing fails (so that "falling back to single-threaded parsing" really is the same as ntasks=1)?

Jun 28 '22 21:06 nickrobinson251

Multi-threading fails here, I'm slowly trying to walk through and figure out what's going on but it's quite difficult and overwhelming to understand.

Jun 29 '22 13:06 mattBrzezinski

I'd be curious to know the values here and why that check failed, especially on the full file where it seems like we should have enough columns to get a good % probability of finding the right row endings.

It sounds like @nickrobinson251 is probably right that we're not resetting things correctly when multithreaded parsing fails, so we're "stuck" with potentially bad types.

Jun 29 '22 13:06 quinnj

CSV.jl CSV.jl copied to clipboard

Threaded parsing type mismatch depending on `ntasks` value

Problem

CSV.jl
CSV.jl copied to clipboard