Patrick Hoefler

Results 345 comments of Patrick Hoefler

> However it's interesting that this PR would likely have resulted in the same problem if they were in the same repo because we would still be using path rules...

Can you say a bit more about the size of your parquet files, worker specs, ... dask-expr fuses multiple parquet files to a single partition until we reach 75MB in...

No, pprint is not optimising your query Can you try setting ``` dask.config.set({"dataframe.parquet.minimum-partition-size": 1}) ``` that will disable the fusion of partitions in read_parquet

This is a little bit puzzling to me tbh > Dask tries to perform value counting using a single chunk? So, it might lead to OOM on a single node?...

I have a PR here https://github.com/dask/dask-expr/pull/1124 that will improve the value_counts case, but as @fjetter said calling compute will still pull all the data into a single partition.

That's currently not a priority for us (the hard version), but I'd be happy to review if you want to take a stab at it

sorting and set_index are **not** lazy, this is correct. plain shuffle is, but it doesn't guarantee order

The query optimizer made this kind of lazy for users, but not actually. The pre-compute is now triggered during optimization, i.e. if you run ``df.set_index(..).optimize()``. (you need more than a...

No, we aren't squashing anything here. We are calculating intervals of the column that we are sorting by, i.e. x1 < x2

We have to figure the intervals out since we don't know what is in the column, i.e. x1 < x2