Patrick Hoefler comments

Results 345 comments of


                                            Patrick Hoefler

Merge dask and distributed repos?

> However it's interesting that this PR would likely have resulted in the same problem if they were in the same repo because we would still be using path rules...

Out of memory

Can you say a bit more about the size of your parquet files, worker specs, ... dask-expr fuses multiple parquet files to a single partition until we reach 75MB in...

Out of memory

No, pprint is not optimising your query Can you try setting ``` dask.config.set({"dataframe.parquet.minimum-partition-size": 1}) ``` that will disable the fusion of partitions in read_parquet

Out of memory

This is a little bit puzzling to me tbh > Dask tries to perform value counting using a single chunk? So, it might lead to OOM on a single node?...

Out of memory

I have a PR here https://github.com/dask/dask-expr/pull/1124 that will improve the value_counts case, but as @fjetter said calling compute will still pull all the data into a single partition.

Sort dask array

That's currently not a priority for us (the hard version), but I'd be happy to review if you want to take a stab at it

Sort dask array

sorting and set_index are **not** lazy, this is correct. plain shuffle is, but it doesn't guarantee order

Sort dask array

The query optimizer made this kind of lazy for users, but not actually. The pre-compute is now triggered during optimization, i.e. if you run ``df.set_index(..).optimize()``. (you need more than a...

Sort dask array

No, we aren't squashing anything here. We are calculating intervals of the column that we are sorting by, i.e. x1 < x2

Sort dask array

We have to figure the intervals out since we don't know what is in the column, i.e. x1 < x2