CSV.jl
CSV.jl copied to clipboard
Threaded parsing type mismatch depending on `ntasks` value
Problem
We've ran across a very odd issue where depending on the value of ntasks
set the type of parsed is different.
CSV.File("foobar.csv"; ntasks=60, debug=true)
...
types after parsing: Type[Float64], pool = (0.2, 500)
CSV.File("foobar.csv"; ntasks=120, debug=true)
types after parsing: Type[String31], pool = (0.2, 500)
File to replicate the issue, foobar.csv
Additional fact from the original observed file: this column was the right-most column but there were several columns of various types including Float64, string, and Int64, and only this column had issues.
I think i know what's happening, but not what to do about it.
when we chunk up the file in the ntasks=120
case we get unlucky and end up with the last 2 bytes in the chunk being a newline and a leading negation sign.
it's a bit like trying to parse a file that looks like
1.0
2.0
-3.0
4.0
5.0
(i.e. 1.0\n2.0\n-3.0\n4.0\n5.0
) by parsing it in two chunks: 1.0\n2.0\n-
and 3.0\4.0\n5.0
.
When we try to parse that first chunk as Float64
values (as we do as part of detect
), we're going to parse 1.0
, then 2.0
then try to parse -
as a Float64
, which is going to fail, so detect
decides on this basis that the column isn't Float64
s afterall (and falls back to a parsing the column as a string)
Hmmmm.....I don't think that should be possible because we do extra work to ensure chunks only get split exactly on the newline character, so \n
should always end a chunk and the next character would be the start of the next chunk.
hmmmm
well, i'm very curious to find out what is going on 😂
the debug output is
julia> CSV.File("foobar.csv"; ntasks=120, debug=true)
header is: 1, skipto computed as: 2
headerpos = 1, datapos = 8
estimated rows: 3598
detected delimiter: ","
column names detected: [:foobar]
byte position of data computed at: 8
computed types are: nothing
initial byte positions before adjusting for start of rows: [8, 520, 1032, 1544, 2056, 2568, 3080, 3592, 4104, 4616, 5128, 5640, 6152, 6664, 7176, 7688, 8200, 8712, 9224, 9736, 10248, 10760, 11272, 11784, 12296, 12808, 13320, 13832, 14344, 14856, 15368, 15880, 16392, 16904, 17416, 17928, 18440, 18952, 19464, 19976, 20488, 21000, 21512, 22024, 22536, 23048, 23560, 24072, 24584, 25096, 25608, 26120, 26632, 27144, 27656, 28168, 28680, 29192, 29704, 30216, 30728, 31240, 31752, 32264, 32776, 33288, 33800, 34312, 34824, 35336, 35848, 36360, 36872, 37384, 37896, 38408, 38920, 39432, 39944, 40456, 40968, 41480, 41992, 42504, 43016, 43528, 44040, 44552, 45064, 45576, 46088, 46600, 47112, 47624, 48136, 48648, 49160, 49672, 50184, 50696, 51208, 51720, 52232, 52744, 53256, 53768, 54280, 54792, 55304, 55816, 56328, 56840, 57352, 57864, 58376, 58888, 59400, 59912, 60424, 60936, 61526]
something went wrong chunking up a file for multithreaded parsing, falling back to single-threaded parsing
time for initial parsing: 5.262593030929565
types after parsing: Type[String31], pool = (0.2, 500)
I think a few funny things are going on here:
- multi-threaded parsing fails... and i'm curious why.
- i thought this usually only happened when we got unlucky with quoted columns, and there's no quoted data here... but maybe my understanding is wrong and multi-threaded parsing is known to fail in other cases
- multi-threaded parsing detects
String31
(and fails)- i'm not sure if this is the same as the point above or not, i.e. the incorrect type detection and the failure are one and the same thing
- i think what happens is we somehow get
detect
being passedpos == len
and we're trying to parse a single character that happens to be the-
character (which is then whydetect
chooses a string type)... but how we get here i'm not sure
- i think we're parsing things as a
String31
because this is what is detected by multi-threaded parsing... but multi-threaded parsing fails and yet we still used the detectedString31
type for single-threaded parsing- should we reset the column types (to
NeedsTypeDetection
for columns where the type wasn't user-given) if multi-threaded parsing fails (so that "falling back to single-threaded parsing" really is the same asntasks=1
)?
- should we reset the column types (to
Multi-threading fails here, I'm slowly trying to walk through and figure out what's going on but it's quite difficult and overwhelming to understand.
I'd be curious to know the values here and why that check failed, especially on the full file where it seems like we should have enough columns to get a good % probability of finding the right row endings.
It sounds like @nickrobinson251 is probably right that we're not resetting things correctly when multithreaded parsing fails, so we're "stuck" with potentially bad types.