Patrick Hoefler
Patrick Hoefler
This is definitely an anti pattern is chunks is passed as well and something you should never do. That case seems sensible enough to warn about
Their instance type selection is our biggest foe here. Dask doesn't perform very well on these large instances. using more smaller instances with the same number of cores in aggregate...
So I ran this on Coiled and it's a lot faster with proper instances but the main problem is that their parquet files are not suited for distributed processing. They...
Any chance you have an older arrow version installed?
Ok, that's odd then... I ran this with the dataset that we are hosting in a coiled s3 bucket, i.e. ``` dataset = "s3://coiled-datasets/uber-lyft-tlc/" ``` and that one finished the...
> parquet performance will also depend on the backend used. the pyarrow backend is / should be faster but it has still a lot of sharp edges and isn't the...
The reason is probably Gil contention, could you try creating an explicit cluster for Dask? i.e. ``` from distributed import Client def fill_holes(geometry, min_hole_size): """ Fill holes in a geometry...
The most likely culprit for this is GIL contention, which blocks the other threads, only running on a single core. You can solve that problem by creating a distributed cluster...
Re your runtime (can't comment on what needs to be in a single partition): Could you try creating a cluster before you call compute? That should help with parallelising things,...
only if your result is a delayed object, so no for the dataframe case