CSV.jl
CSV.jl copied to clipboard
CSV.read error with limit on multiple threads
This is run on 8 threads on a large file:
julia> describe(CSV.read("instagram_locations.csv", DataFrame, limit=1000), :eltype)
ERROR: TaskFailedException
nested task error: BoundsError: attempt to access 1000-element Vector{UInt32} at index [1001]
Stacktrace:
[1] setindex!
@ .\array.jl:966 [inlined]
[2] checkpooled!(#unused#::Type{Union{Missing, String31}}, pertaskcolumns::Vector{Vector{CSV.Column}}, col::CSV.Column, j::Int64, ntasks::Int64, nrows::Int64, ctx::CSV.Context)
@ CSV ~\.julia\packages\CSV\1P1tQ\src\file.jl:513
[3] multithreadpostparse(ctx::CSV.Context, ntasks::Int64, pertaskcolumns::Vector{Vector{CSV.Column}}, rows::Vector{Int64}, finalrows::Int64, j::Int64, col::CSV.Column)
@ CSV ~\.julia\packages\CSV\1P1tQ\src\file.jl:432
[4] macro expansion
@ ~\.julia\packages\WorkerUtilities\ey0fP\src\WorkerUtilities.jl:384 [inlined]
[5] (::CSV.var"#31#36"{CSV.Context, Int64, Vector{Vector{CSV.Column}}, Vector{Int64}, Int64, Int64, CSV.Column})()
@ CSV .\threadingconstructs.jl:258
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base .\task.jl:436
[2] macro expansion
@ .\task.jl:455 [inlined]
[3] CSV.File(ctx::CSV.Context, chunking::Bool)
@ CSV ~\.julia\packages\CSV\1P1tQ\src\file.jl:281
[4] File
@ ~\.julia\packages\CSV\1P1tQ\src\file.jl:226 [inlined]
[5] #File#28
@ ~\.julia\packages\CSV\1P1tQ\src\file.jl:222 [inlined]
[6] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:limit,), Tuple{Int64}}})
@ ~\.julia\packages\CSV\1P1tQ\src\CSV.jl:117
[7] top-level scope
@ REPL[10]:1
Bump, I'm seeing this bug too on v0.10.10. Any workarounds would be helpful too.
ntasks=1 works but it's slow.
Can either of you try on latest main branch? We just merged a related fix.
No luck here. limit = 100_000 gives
nested task error: BoundsError: attempt to access 100000-element Vector{UInt32} at index [100001]
in the same place as shown in the OP.