Parquet.jl
Parquet.jl copied to clipboard
Serious performance issues with read
See the repo and the code below.
https://github.com/xiaodaigh/parquet-data-collection/blob/master/test_files.jl
To replicate, just clone the repo and run the above file (included in the repo).
It takes a LONG time.
I suspect is the that RecordCursor loops thru each row and this is highly inefficient especially considering Parquet is a columnar format.
I think the lowest level of cursor/iterations should be at the column chunk level, not at row level. The row level cursor should be in some utility library like ParquetFiles.jl to keep this package clean.
Data does get read in chunks internally.
Could you post some benchmark results comparing times of Parquet.jl master to the last release?
Could you post some benchmark results comparing times of Parquet.jl master to the last release?
I mean to say, it's dead slow on both master and latest release.
I was able to trace some issues, PR here: https://github.com/JuliaIO/Parquet.jl/pull/77 But more can be done.
The issue seems to be much deeper with the reader design. In this fork here, I can read the file in 3~4 seconds, but Parquet.jl #77 it takes 130s
See the fork here https://github.com/xiaodaigh/Parquet.jl/blob/master/README.md
I tried the fork, following the readme pointed to above. Not sure if I got it right, but read_parquet seems to return a single row in 13 sec.
julia> using Parquet
julia> @time df = read_parquet("dsd50p.parquet");
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:07
13.827823 seconds (60.12 M allocations: 2.331 GiB, 4.70% gc time)
julia> isa(df, NamedTuple)
true
but
read_parquetseems to return a single row in 13 sec.
Try this
df.tile_count
and you will see that it returns the whole column, not just one row. Each name is attached to a vector, not a single value.
As to speed, it's more due to PC differences? I consistently get 3s on my 6 core machine, also I have 2 SSDs on RAID0
Update
BTW, 3s should be considered really slow as R's arrow can read it and summarise it in about ~0.13s. So lot's of room to improve.
Okay, I see the column now. But looks like you are disregarding the nested schema? So this will work only for non nested data?
And you seem to be using the same underlying page and column chunk read methods of Parquet.jl, right? So the current performance issues are on the layer above that.
So this will work only for non nested data?
That is correct. And the issue that is bothering me. I don't understand the nestedness, so I actually don't know how to deal with it.
Ideally, there might be things in my approach that can be used to handle nested schemas?
Honestly, the reason I am not motivated to do nested schema thing is that I have never needed it or used it.
Would it be acceptable to have a set of functions just to read non-nested data?
So the current performance issues are on the layer above that.
Yes I think so. Although, there's definitely much room to improve for the values method, e.g. it's using push! which is causing much more allocation then necessary for unnested column reads (not sure about nested columns, so kept my comment to unnested columns).
Would it be acceptable to have a set of functions just to read non-nested data?
I guess. But I would try and figure out why we can not handle nestedness and be performant as well.
why we can not handle nestedness and be performant as well.
I don't see a reason. The issue is I don't have the knowledge to contribute to nestedness and performant code. Also, I have no motivation there since I never use nested columns.
So i only see 3 options
- we try a PR with just unnested column (which I think is the majority of use cases out there)
- someone other than me writes performant code to handle nested and unested columns
- do nothing
I would prefer 1 to 3, but I can only contribute to 1 to prevent 3, not 2. If 2 can happen, it would be great! But if 2 takes a year, then I would rather see 1 instead of 3 being the eventuality. I can commit to doing 1.
I don't understand the nestedness, so I actually don't know how to deal with it.
This is basically what the definition levels and repetition levels allow. It is based on dremel, there would be some material out there if you searcg for it.
there would be some material out there if you searcg for it.
yeah, I am not motivated that's the issue, since I never needed it. If you are too busy or also not motivated. Do you think we can ask NumFOCUS for funding? I would be happy to read the dremel paper and implement it for some monetary compensation.
If we get funding do you want to split it? Happy for it to be a solo effort on either side too. Happy either way.
we try a PR with just unnested column (which I think is the majority of use cases out there)
I am open to this. Maybe we should keep it in a sub module to keep the interface clean. The sub module can implement the simpler non-nested read and write using the other lower level methods.
I am open to this. Maybe we should keep it in a sub module to keep the interface clean.
OK. It will be quite a release with writer and column-based read. Or you want to finish off the writer PR first?
Update See WIP PR #79
Maybe we should keep it in a sub module
Any ideas for the sub module name? UnnestedColumnReader? It might be painful when the reader is fixed to deal with Nested columns too.
just wanted to chime in real quick to say that it'd be great if we could improve the read performance! the difference between the julia parquet reader and fastparquet (via pycall) was like night and day for me. but i totally understand if there's no funding etc