CSV.jl icon indicating copy to clipboard operation
CSV.jl copied to clipboard

Parsing fails with long strings

Open CarlColglazier opened this issue 3 years ago • 7 comments

Replication

Using this test file saved as "test.csv"

test.csv

Run the following to try to read it:

CSV.read("test.csv", DataFrame, ntasks=1)

Which gives the following error:

ERROR: ArgumentError: length argument to Parsers.PosLen (1100002) is too large; max length allowed is 1048575
Stacktrace:
  [1] lentoolarge(len::Int64)
    @ Parsers ~/.julia/packages/Parsers/KmPKe/src/utils.jl:302
  [2] PosLen
    @ ~/.julia/packages/Parsers/KmPKe/src/utils.jl:306 [inlined]
  [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
    @ Parsers ~/.julia/packages/Parsers/KmPKe/src/strings.jl:289
  [4] xparse
    @ ~/.julia/packages/Parsers/KmPKe/src/strings.jl:3 [inlined]
  [5] detectcell(buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context, rowsguess::Int64)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:739
  [6] parserow
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:598 [inlined]
  [7] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, #unused#::Type{Tuple{}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:551
  [8] CSV.File(ctx::CSV.Context, chunking::Bool)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:291
  [9] File
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:226 [inlined]
 [10] #File#25
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:222 [inlined]
 [11] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:ntasks,), Tuple{Int64}}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/CSV.jl:91
 [12] top-level scope
    @ REPL[4]:1

CarlColglazier avatar Jun 20 '22 15:06 CarlColglazier

yeah, this is tricky one -- some discusson about it here: https://github.com/JuliaData/CSV.jl/issues/935 and https://github.com/JuliaData/Parsers.jl/pull/98

nickrobinson251 avatar Jun 20 '22 15:06 nickrobinson251

Ugh, this is really bad and it happens even without enormously long lines, just big files...

Here's the Census 2020 ACS household data

https://www2.census.gov/programs-surveys/acs/experimental/2020/data/pums/1-Year/csv_hus.zip

unzip it and try to read the second large file:

df = CSV.read("psam_husb.csv",DataFrame)

You'll get one of these parse errors. Lines in this file are like a few thousand characters, not hundreds of thousands of characters. But there are 645744 lines in the file.

Is there a workaround here?

dlakelan avatar Aug 02 '22 05:08 dlakelan

@dlakelan, it sounds to me like there might be some bad quoting in your file. The limits when you would hit this bug are:

  • Greater than ~1MB for an individual cell value
  • Greater than ~4.4TB for entire file size

If there was a cell, however, that started with "some text ..., but there wasn't a terminating " character, then the parsing will continue until the EOF looking for the closing ".

quinnj avatar Aug 02 '22 05:08 quinnj

FWIW I can't reproduce it on that file:

julia> CSV.read(f, DataFrame)
647968×239 DataFrame
    Row │ RT       SERIALNO       DIVISION  PUMA   REGION  ST     ADJHSG   ADJINC   WGTP   NP     TYPEHUGQ  ACCESSINET  ACR      AGS      BATH     BDSP     BLD      BROADBND  COMPOTHX  CONP     DIALUP   ELEFP  ⋯
        │ String1  String15       Int64     Int64  Int64   Int64  Int64    Int64    Int64  Int64  Int64     Int64?      Int64?   Int64?   Int64?   Int64?   Int64?   Int64?    Int64?    Int64?   Int64?   Int64? ⋯
────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      1 │ H        2020GQ0000022         4    400       2     29  1000000  1006149      0      1         2     missing  missing  missing  missing  missing  missing   missing   missing  missing  missing  missin ⋯
      2 │ H        2020GQ0000086
(...)

(jl_o4iDqb) pkg> st
Status `C:\Users\ngudat\AppData\Local\Temp\jl_o4iDqb\Project.toml`
  [336ed68f] CSV v0.10.4
  [a93c6f00] DataFrames v1.3.4

nilshg avatar Aug 02 '22 07:08 nilshg

I also cannot reproduce this on that file on mac, same versions as the Windows test above.

jd-foster avatar Aug 02 '22 09:08 jd-foster

Sure enough, on line 14 my version of the file has a quote character at the end:

image

I'll re-download the zip file and uncompress from scratch see if it was just damage in the download

dlakelan avatar Aug 02 '22 14:08 dlakelan

Ok, Sure enough, a fresh download and the file loads... Computers are weird. Thanks for you guys helping with this!

dlakelan avatar Aug 02 '22 14:08 dlakelan

Ok, Sure enough, a fresh download and the file loads... Computers are weird. Thanks for you guys helping with this!

Glad it's sorted!

There's #935 open for the "really long strings" issue, so will close this one.

nickrobinson251 avatar Oct 07 '22 11:10 nickrobinson251