H3-Pandas icon indicating copy to clipboard operation
H3-Pandas copied to clipboard

GeoDataFrame has no attribute 'h3' only when multiprocessing using JobLib

Open hsbsid opened this issue 2 years ago • 1 comments

I'm using JobLib to try and multiprocess a polyfill operation on a large dataset of polygons.

This line runs perfectly fine: gdf.h3.polyfill(11)

But when I split the gdf into chunks and run with joblib, I get an error:

def polyfill_parallel(i, gdf_chunk):
gdf_chunk = gpd.GeoDataFrame(gdf_chunk)

#perform polyfill on the chunk
return gdf_chunk.h3.polyfill(11)

Parallel(n_jobs=-1, verbose=1)(delayed(polyfill_parallel)(i,gdf_chunk) for i, gdf_chunk in gdf_chunks))

I get the error: AttributeError: 'GeoDataFrame' object has no attribute 'h3'

I tried the chunking method because I was initially looping over the rows of the main gdf, but that was passing GeoSeries to the function, which I thought was the cause of the error, but looks like it wasn't.

hsbsid avatar Jan 10 '24 19:01 hsbsid

Hi @hsbsid the issue here is that the subprocesses spawned by joblib do not import h3pandas. Joblib will import geopandas as the passed object is a GeoDataFrame, but it doesn't know about h3pandas.

The simplest fix is to import h3pandas within the function:

def polyfill_parallel_with_import(gdf_chunk):
    import h3pandas
    gdf_chunk = gpd.GeoDataFrame(gdf_chunk)

    #perform polyfill on the chunk
    return gdf_chunk.h3.polyfill(11)

Dask-GeoPandas example

Note that dask-geopandas might be a more convenient API to work with than Joblib, as it chunks the dataframe for you and has a great API to work with distributed Dataframes (Dask Dataframes).

I'm providing a full working example for future reference.

import h3pandas
import pandas as pd
import dask_geopandas

# Download and subset data
df = pd.read_csv('https://github.com/uber-web/kepler.gl-data/raw/master/nyctrips/data.csv')
df = df.rename({'pickup_longitude': 'lng', 'pickup_latitude': 'lat'}, axis=1)[['lng', 'lat', 'passenger_count']]
qt = 0.1
df = df.loc[(df['lng'] > df['lng'].quantile(qt)) & (df['lng'] < df['lng'].quantile(1-qt)) 
            & (df['lat'] > df['lat'].quantile(qt)) & (df['lat'] < df['lat'].quantile(1-qt))]

gdf = df.h3.geo_to_h3(9).h3.h3_to_geo_boundary()

def polyfill_parallel(gdf_chunk):
    import h3pandas
    return gdf_chunk.h3.polyfill(11)

import dask_geopandas

dgdf = dask_geopandas.from_geopandas(gdf, npartitions=2)
# Compute a sample of the output to help Dask understand its structure
meta = polyfill_parallel(gdf.iloc[[0]])
dgdf.map_partitions(polyfill_parallel, meta=meta).compute()

Notice that we still need to import h3pandas. This library is quite quick update though, so it shouldn't slow down the processes too much.

DahnJ avatar Feb 02 '24 22:02 DahnJ

Closing as this was likely answered. Feel free to reopen if you have further questions!

DahnJ avatar Mar 30 '24 17:03 DahnJ