Parquet.jl Serious performance issues with read

See the repo and the code below.

https://github.com/xiaodaigh/parquet-data-collection/blob/master/test_files.jl

To replicate, just clone the repo and run the above file (included in the repo).

It takes a LONG time.

May 18 '20 02:05 xiaodaigh

I suspect is the that RecordCursor loops thru each row and this is highly inefficient especially considering Parquet is a columnar format.

I think the lowest level of cursor/iterations should be at the column chunk level, not at row level. The row level cursor should be in some utility library like ParquetFiles.jl to keep this package clean.

May 18 '20 05:05 xiaodaigh

Data does get read in chunks internally.

Could you post some benchmark results comparing times of Parquet.jl master to the last release?

May 18 '20 06:05 tanmaykm

Could you post some benchmark results comparing times of Parquet.jl master to the last release?

I mean to say, it's dead slow on both master and latest release.

May 18 '20 11:05 xiaodaigh

I was able to trace some issues, PR here: https://github.com/JuliaIO/Parquet.jl/pull/77 But more can be done.

May 18 '20 13:05 tanmaykm

The issue seems to be much deeper with the reader design. In this fork here, I can read the file in 3~4 seconds, but Parquet.jl #77 it takes 130s

See the fork here https://github.com/xiaodaigh/Parquet.jl/blob/master/README.md

May 19 '20 11:05 xiaodaigh

I tried the fork, following the readme pointed to above. Not sure if I got it right, but read_parquet seems to return a single row in 13 sec.

julia> using Parquet

julia> @time df =  read_parquet("dsd50p.parquet");
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:07
 13.827823 seconds (60.12 M allocations: 2.331 GiB, 4.70% gc time)

julia> isa(df, NamedTuple)
true

May 19 '20 12:05 tanmaykm

but read_parquet seems to return a single row in 13 sec.

Try this

df.tile_count

and you will see that it returns the whole column, not just one row. Each name is attached to a vector, not a single value.

As to speed, it's more due to PC differences? I consistently get 3s on my 6 core machine, also I have 2 SSDs on RAID0

Update

BTW, 3s should be considered really slow as R's arrow can read it and summarise it in about ~0.13s. So lot's of room to improve.

May 19 '20 12:05 xiaodaigh

Okay, I see the column now. But looks like you are disregarding the nested schema? So this will work only for non nested data?

May 19 '20 13:05 tanmaykm

And you seem to be using the same underlying page and column chunk read methods of Parquet.jl, right? So the current performance issues are on the layer above that.

May 19 '20 13:05 tanmaykm

So this will work only for non nested data?

That is correct. And the issue that is bothering me. I don't understand the nestedness, so I actually don't know how to deal with it.

Ideally, there might be things in my approach that can be used to handle nested schemas?

Honestly, the reason I am not motivated to do nested schema thing is that I have never needed it or used it.

Would it be acceptable to have a set of functions just to read non-nested data?

May 19 '20 13:05 xiaodaigh

So the current performance issues are on the layer above that.

Yes I think so. Although, there's definitely much room to improve for the values method, e.g. it's using push! which is causing much more allocation then necessary for unnested column reads (not sure about nested columns, so kept my comment to unnested columns).

May 19 '20 13:05 xiaodaigh

Would it be acceptable to have a set of functions just to read non-nested data?

I guess. But I would try and figure out why we can not handle nestedness and be performant as well.

May 19 '20 13:05 tanmaykm

why we can not handle nestedness and be performant as well.

I don't see a reason. The issue is I don't have the knowledge to contribute to nestedness and performant code. Also, I have no motivation there since I never use nested columns.

So i only see 3 options

we try a PR with just unnested column (which I think is the majority of use cases out there)
someone other than me writes performant code to handle nested and unested columns
do nothing

I would prefer 1 to 3, but I can only contribute to 1 to prevent 3, not 2. If 2 can happen, it would be great! But if 2 takes a year, then I would rather see 1 instead of 3 being the eventuality. I can commit to doing 1.

May 19 '20 13:05 xiaodaigh

I don't understand the nestedness, so I actually don't know how to deal with it.

This is basically what the definition levels and repetition levels allow. It is based on dremel, there would be some material out there if you searcg for it.

May 19 '20 13:05 tanmaykm

there would be some material out there if you searcg for it.

yeah, I am not motivated that's the issue, since I never needed it. If you are too busy or also not motivated. Do you think we can ask NumFOCUS for funding? I would be happy to read the dremel paper and implement it for some monetary compensation.

If we get funding do you want to split it? Happy for it to be a solo effort on either side too. Happy either way.

May 19 '20 13:05 xiaodaigh

we try a PR with just unnested column (which I think is the majority of use cases out there)

I am open to this. Maybe we should keep it in a sub module to keep the interface clean. The sub module can implement the simpler non-nested read and write using the other lower level methods.

May 19 '20 13:05 tanmaykm

I am open to this. Maybe we should keep it in a sub module to keep the interface clean.

OK. It will be quite a release with writer and column-based read. Or you want to finish off the writer PR first?

Update See WIP PR #79

May 19 '20 13:05 xiaodaigh

Maybe we should keep it in a sub module

Any ideas for the sub module name? UnnestedColumnReader? It might be painful when the reader is fixed to deal with Nested columns too.

May 19 '20 13:05 xiaodaigh

just wanted to chime in real quick to say that it'd be great if we could improve the read performance! the difference between the julia parquet reader and fastparquet (via pycall) was like night and day for me. but i totally understand if there's no funding etc

Aug 16 '21 02:08 ym-han

Parquet.jl Parquet.jl copied to clipboard

Serious performance issues with read

Parquet.jl
Parquet.jl copied to clipboard