CSV.jl
CSV.jl copied to clipboard
Parsing fails with long strings
Replication
Using this test file saved as "test.csv"
Run the following to try to read it:
CSV.read("test.csv", DataFrame, ntasks=1)
Which gives the following error:
ERROR: ArgumentError: length argument to Parsers.PosLen (1100002) is too large; max length allowed is 1048575
Stacktrace:
[1] lentoolarge(len::Int64)
@ Parsers ~/.julia/packages/Parsers/KmPKe/src/utils.jl:302
[2] PosLen
@ ~/.julia/packages/Parsers/KmPKe/src/utils.jl:306 [inlined]
[3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
@ Parsers ~/.julia/packages/Parsers/KmPKe/src/strings.jl:289
[4] xparse
@ ~/.julia/packages/Parsers/KmPKe/src/strings.jl:3 [inlined]
[5] detectcell(buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context, rowsguess::Int64)
@ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:739
[6] parserow
@ ~/.julia/packages/CSV/jFiCn/src/file.jl:598 [inlined]
[7] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, #unused#::Type{Tuple{}})
@ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:551
[8] CSV.File(ctx::CSV.Context, chunking::Bool)
@ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:291
[9] File
@ ~/.julia/packages/CSV/jFiCn/src/file.jl:226 [inlined]
[10] #File#25
@ ~/.julia/packages/CSV/jFiCn/src/file.jl:222 [inlined]
[11] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:ntasks,), Tuple{Int64}}})
@ CSV ~/.julia/packages/CSV/jFiCn/src/CSV.jl:91
[12] top-level scope
@ REPL[4]:1
yeah, this is tricky one -- some discusson about it here: https://github.com/JuliaData/CSV.jl/issues/935 and https://github.com/JuliaData/Parsers.jl/pull/98
Ugh, this is really bad and it happens even without enormously long lines, just big files...
Here's the Census 2020 ACS household data
https://www2.census.gov/programs-surveys/acs/experimental/2020/data/pums/1-Year/csv_hus.zip
unzip it and try to read the second large file:
df = CSV.read("psam_husb.csv",DataFrame)
You'll get one of these parse errors. Lines in this file are like a few thousand characters, not hundreds of thousands of characters. But there are 645744 lines in the file.
Is there a workaround here?
@dlakelan, it sounds to me like there might be some bad quoting in your file. The limits when you would hit this bug are:
- Greater than ~1MB for an individual cell value
- Greater than ~4.4TB for entire file size
If there was a cell, however, that started with "some text ..., but there wasn't a terminating " character, then the parsing will continue until the EOF looking for the closing ".
FWIW I can't reproduce it on that file:
julia> CSV.read(f, DataFrame)
647968×239 DataFrame
Row │ RT SERIALNO DIVISION PUMA REGION ST ADJHSG ADJINC WGTP NP TYPEHUGQ ACCESSINET ACR AGS BATH BDSP BLD BROADBND COMPOTHX CONP DIALUP ELEFP ⋯
│ String1 String15 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64? Int64? Int64? Int64? Int64? Int64? Int64? Int64? Int64? Int64? Int64? ⋯
────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ H 2020GQ0000022 4 400 2 29 1000000 1006149 0 1 2 missing missing missing missing missing missing missing missing missing missing missin ⋯
2 │ H 2020GQ0000086
(...)
(jl_o4iDqb) pkg> st
Status `C:\Users\ngudat\AppData\Local\Temp\jl_o4iDqb\Project.toml`
[336ed68f] CSV v0.10.4
[a93c6f00] DataFrames v1.3.4
I also cannot reproduce this on that file on mac, same versions as the Windows test above.
Sure enough, on line 14 my version of the file has a quote character at the end:

I'll re-download the zip file and uncompress from scratch see if it was just damage in the download
Ok, Sure enough, a fresh download and the file loads... Computers are weird. Thanks for you guys helping with this!
Ok, Sure enough, a fresh download and the file loads... Computers are weird. Thanks for you guys helping with this!
Glad it's sorted!
There's #935 open for the "really long strings" issue, so will close this one.