Patrick Hoefler comments

Results 345 comments of


                                            Patrick Hoefler

Opening s3 file with rasterio 1.4 immediately loads the data to the machine

This is definitely an anti pattern is chunks is passed as well and something you should never do. That case seems sensible enough to warn about

Bodo vs Dask comparison

Their instance type selection is our biggest foe here. Dask doesn't perform very well on these large instances. using more smaller instances with the same number of cores in aggregate...

Bodo vs Dask comparison

So I ran this on Coiled and it's a lot faster with proper instances but the main problem is that their parquet files are not suited for distributed processing. They...

Bodo vs Dask comparison

Any chance you have an older arrow version installed?

Bodo vs Dask comparison

Ok, that's odd then... I ran this with the dataset that we are hosting in a coiled s3 bucket, i.e. ``` dataset = "s3://coiled-datasets/uber-lyft-tlc/" ``` and that one finished the...

Bodo vs Dask comparison

> parquet performance will also depend on the backend used. the pyarrow backend is / should be faster but it has still a lot of sharp edges and isn't the...

compute takes ages to produce the result.

The reason is probably Gil contention, could you try creating an explicit cluster for Dask? i.e. ``` from distributed import Client def fill_holes(geometry, min_hole_size): """ Fill holes in a geometry...

map_partitions (almost) only uses single core

The most likely culprit for this is GIL contention, which blocks the other threads, only running on a single core. You can solve that problem by creating a distributed cluster...

Dissolve using dask-geopandas

Re your runtime (can't comment on what needs to be in a single partition): Could you try creating a cluster before you call compute? That should help with parallelising things,...

Implement fuse option for delayed objects

only if your result is a delayed object, so no for the dataframe case