GeoDataFrame has no attribute 'h3' only when multiprocessing using JobLib
I'm using JobLib to try and multiprocess a polyfill operation on a large dataset of polygons.
This line runs perfectly fine:
gdf.h3.polyfill(11)
But when I split the gdf into chunks and run with joblib, I get an error:
def polyfill_parallel(i, gdf_chunk):
gdf_chunk = gpd.GeoDataFrame(gdf_chunk)
#perform polyfill on the chunk
return gdf_chunk.h3.polyfill(11)
Parallel(n_jobs=-1, verbose=1)(delayed(polyfill_parallel)(i,gdf_chunk) for i, gdf_chunk in gdf_chunks))
I get the error:
AttributeError: 'GeoDataFrame' object has no attribute 'h3'
I tried the chunking method because I was initially looping over the rows of the main gdf, but that was passing GeoSeries to the function, which I thought was the cause of the error, but looks like it wasn't.
Hi @hsbsid the issue here is that the subprocesses spawned by joblib do not import h3pandas. Joblib will import geopandas as the passed object is a GeoDataFrame, but it doesn't know about h3pandas.
The simplest fix is to import h3pandas within the function:
def polyfill_parallel_with_import(gdf_chunk):
import h3pandas
gdf_chunk = gpd.GeoDataFrame(gdf_chunk)
#perform polyfill on the chunk
return gdf_chunk.h3.polyfill(11)
Dask-GeoPandas example
Note that dask-geopandas might be a more convenient API to work with than Joblib, as it chunks the dataframe for you and has a great API to work with distributed Dataframes (Dask Dataframes).
I'm providing a full working example for future reference.
import h3pandas
import pandas as pd
import dask_geopandas
# Download and subset data
df = pd.read_csv('https://github.com/uber-web/kepler.gl-data/raw/master/nyctrips/data.csv')
df = df.rename({'pickup_longitude': 'lng', 'pickup_latitude': 'lat'}, axis=1)[['lng', 'lat', 'passenger_count']]
qt = 0.1
df = df.loc[(df['lng'] > df['lng'].quantile(qt)) & (df['lng'] < df['lng'].quantile(1-qt))
& (df['lat'] > df['lat'].quantile(qt)) & (df['lat'] < df['lat'].quantile(1-qt))]
gdf = df.h3.geo_to_h3(9).h3.h3_to_geo_boundary()
def polyfill_parallel(gdf_chunk):
import h3pandas
return gdf_chunk.h3.polyfill(11)
import dask_geopandas
dgdf = dask_geopandas.from_geopandas(gdf, npartitions=2)
# Compute a sample of the output to help Dask understand its structure
meta = polyfill_parallel(gdf.iloc[[0]])
dgdf.map_partitions(polyfill_parallel, meta=meta).compute()
Notice that we still need to import h3pandas. This library is quite quick update though, so it shouldn't slow down the processes too much.
Closing as this was likely answered. Feel free to reopen if you have further questions!