Parquet Reboot
Our parquet performance is bad. I get 20MB/s in real-world use cases on the cloud where I would expect 500 MB/s. This accounts for ~80% of our runtime in complex dataframe queries in TPC-H. Systems like P2P, the GIL, Pandas copy-on-write, PyArrow strings, etc, are all inconsequential relative to this performance bottleneck.
Unfortunately, improving this situation is difficult because our current Parquet implementation has many layers and many systems all interwoven with each other. What should be a relatively simple system is today somewhat opaque. I would like for us to consider a rewrite.
There are many things to consider here. I'll follow up with some personal thoughts, and I welcome thoughts from others.
Personal recommendation
If I were to build this I would remove a lot. I'd remove engines and go Arrow-first with both parquet deserialization and with data access. I think that we could make this a single-python-file effort that had reliably good performance and much easier maintenance.
We would lose some features:
- fastparquet
- index / divisions
- metadata file
- Probably we need to maintain a
filesystem=keyword in case someone has an odd filesystem and needs to use fsspec, but this would be the exception I think - Probably the loss of engines means that RAPIDS folks need to maintain their own read-parquet code paths outside of Dask. I think that this is ok, they've done that before with dask-cudf. Probably they just do it again.
Some things to gain
- There are various settings we should be aware of from https://github.com/apache/arrow/issues/38389 that might help improve performance
- Consider keeping RLE columns as categorical
- Preferring LZ4 compression over snappy (lz4 used to not be supported, but maybe is today, and is maybe faster?)
- Use arrow filesystems by default when available
Having a smaller environment because you don’t have to install arrow is the main benefit of fastparquet as far as I am aware of. Pandas will require arrow as a dependency starting with the 3.0 release if there aren’t any unexpected developments. This would render the environment size argument irrelevant
There are likely other benefits to fastparquet, but mostly I just want one system rather than having to support many systems and while I like fastparquet, I like pyarrow marginally more.
+1
I think specializing on pyarrow will be the biggest benefit here that allows to cut complexity.
Additionally, I would like us to review closely how we want to deal with files and rowgroups and the collection of statistics.
I'm very strongly biased to not collect statistics of the entire dataset automatically (if ever) but I appreciate that there are use cases where we have to deal with parquet metadata. I would like us to revisit the existing logic and invest time in reproducers that support the features we are offering and allow us to run tests for them.
Re: single-file-effort. I believe this is something that we also have to review a little more closely. Especially with the introduction of FusedIO in dask-expr there are significant optimizations available when treating multiple files in a batch (pre-fetching IO, for instance). pyarrow implements these capabilities transparently to a certain degree.
I'm going to take some time this morning and prototype a very simple implementation. If it works I'll use it for TPC-H. We'll see how far I get. I give it a 50/50 chance that I give up because it's too complex.
https://github.com/dask-contrib/dask-expr/pull/373
Cool - Sorry, I saw dask-contrib/dask-expr#373 but I missed this issue.
I agree that we should simplify things in parquet land, and the best way to do this is to re-write things from scratch. The parquet-engine landscape is dramatically different than it was when the Engine system was first introduced - The PyArrow API and feature set has completely changed since then, and RAPIDS is pretty adaptable on the parquet front anyway.
If you want dd.read_parquet to support hive-partitioned datasets and produce reliably “good” partition sizes, you will end up with a lot of the same bells and whistles that exist now, but you will have the opportunity to organize things in a way that makes sense for the current PyArrow API. Also, now that we have dask-Expr in the mix, the partition-size issue can probably be re-cast as an optimization pass instead of read_parquet arguments.
Somewhat related thoughts: I realize that you are not personally concerned about dask-cudf/rapids - I think that is okay if we can find a way to be smart about “dispatching” Expr classes in Dask-Expr (e.g. some variation or better alternative to https://github.com/dask-contrib/dask-expr/pull/321). If downstream projects have some way to replace specific Expr definitions that don’t suit then, then we can probably avoid the kinds of tech debt that the Engine design has caused in the past. One possible outcome is that we use Dask-Expr to start stripping out unnecessary backend/compute dispatching logic in favor of a lower-maintenance coarse-grained dispatching approach.
Yeah, I'm hoping that we can defer a lot of the fancy parquet logic to Arrow. ParquetDataset has a lot of stuff like hive partitioning, yes?
Mostly I want to avoid a situation where things get sufficiently complex that people stop maintaining them. This seems to be what happened with parquet. We all knew that it was a trainwreck, but no one wanted to spend the time to make it good.
If there is a plan that makes things easy for others to extend and that doesn't make life harder for folks maintaining the core then I have no objection.
I'm hoping that we can defer a lot of the fancy parquet logic to Arrow. ParquetDataset has a lot of stuff like hive partitioning, yes?
Yes, I completely agree that we should defer as much as possible PyArrow. The PyArrow feature set is probably ready to take the reins.
We all knew that it was a trainwreck, but no one wanted to spend the time to make it good.
Yes, it is a mess, but I don't think it is completely a question of time or effort. I'm pretty sure people have just been afraid of breaking the complicated PyArrow/FastParquet/Dask-cuDF/Dask-GeoPandas/etc API contract that was designed at a time when Dask needed to implement most of the ugly dataset parsing/planning logic. We were obviously trying to do something "good" by building a pluggable read_parquet architecture, but the original goal has proven impractical over time.
If there is a plan that makes things easy for others to extend and that doesn't make life harder for folks maintaining the core then I have no objection.
That's good to hear. I'm also pretty motivated to minimize maintenance burden.
Note: It may also be valuable to define simple Expr templates for custom IO Expressions (similar to the way from_map makes it relatively easy to implement custom IO solutions in dask.dataframe). It seems ideal to focus on a simple/performant API and to make it easy for power-users to implement their own custom logic when they need it.
Note: It may also be valuable to define simple Expr templates for custom IO Expressions (similar to the way from_map makes it relatively easy to implement custom IO solutions in dask.dataframe). It seems ideal to focus on a simple/performant API and to make it easy for power-users to implement their own custom logic when they need it
As a heads-up, I'll likely be far more nervous / hesitant around reworking things into a complex hierarchy. You should probably expect a lot more friction and demands for design docs ahead of time for anything like this. I'm likely going to push to keep things very flat and simple.
Hmm. That was not a suggestion for any kind of hierarchy at all, so not sure I understand. Either way, I hear you.
I'd like to comment that if the current parquet code is going to be in "flux" and if the current system isn't really solid/maintained/maintainable, then maybe the advice/best practice of using Parquet (https://docs.dask.org/en/latest/dataframe-best-practices.html#use-parquet) should be (temporarily) removed until a more workable Parquet system is in place.
" and if the current system isn't really solid/maintained/maintainable, then maybe the advice/best practice of using Parquet
Thanks for raising this concern. I believe the API as is is not in flux and it is usable and maintained. It is difficult to maintain and our development velocity is suffering from complexity so we're considering a rewrite.
We haven't made a final decision on how exactly to proceed with this but we will make sure that the transition will be as smooth as possible for users. The one recommendation I can give right now is to stay clear of the already deprecated features that are already raising warnings. They will certainly cease to exist.
Even with all it's flaws in the current implementation, parquet is still far superior to other storage formats for tabular data and this advice in the documentation is still very accurate. This issue should not scare you away from using parquet but rather encourage you to use it because it will get even better down the road.
@fjetter I'm curious about plans here. As I think through docs and Dask messaging there's a bunch of stuff I want to say about Dask and Spark and performance and my guess is that we'll need to address this before we can say good things there with any strength. I know that this topic isn't top of mind for you right now, but I'd like to align on roughly when something like a performant parquet system might arise. No immediate timing pressure on this, but I'd certainly like to align on this, say, early in 2024.
In terms of prioritization I consider getting dask-expr into main dask the most important thing right now. There is still a lot of uncertainty so nobody is working on parquet right now. I hope that the dust settles towards end of January such that we gain more confidence about dask-expr and we can pick parquet up again. So, my best guess is end of Q1
Just a note that I have started experimenting with better ways to handle options like blocksize in the "future" read_parquet implementation. Florian has made the point several times that we shouldn't ever need to parse all the parquet metadata/statistics up-front, and I completely agree with him.
My assumption is that Coiled wants to take ownership of the parquet reboot. That would be welcome on my end, but I'm also happy to help and engage wherever it would be useful.
To be clear, anyone is welcome to propose changes / reinventions of Parquet. We'll eventually take responsibility here assuming no one else does (which is our current assumption to be safe).
On Wed, Dec 20, 2023 at 10:21 AM Richard (Rick) Zamora < @.***> wrote:
Just a note that I have started experimenting https://gist.github.com/rjzamora/2c55d7e1c614b7df8f27dd92205a43fa with better ways to handle options like blocksize in the "future" read_parquet implementation. Florian has made the point several times that we shouldn't ever need to parse all the parquet metadata/statistics up-front, and I completely agree with him.
My assumption is that Coiled wants to take ownership of the parquet reboot. That would be welcome on my end, but I'm also happy to help and engage wherever it would be useful.
— Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/10602#issuecomment-1864930646, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAPQ67LFYHB5SN65T3YKMUCBAVCNFSM6AAAAAA6RTLYZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRUHEZTANRUGY . You are receiving this because you authored the thread.Message ID: @.***>
Probably we need to maintain a filesystem= keyword in case someone has an odd filesystem and needs to use fsspec, but this would be the exception I think
Just found out that arrow doesn't support azure blob storage, so that' a good argument to keep the filesystem keyword