spatialpandas icon indicating copy to clipboard operation
spatialpandas copied to clipboard

DaskGeoDataFrame parquet write error - Series object has no attribute total_bounds

Open 4andy opened this issue 1 year ago • 7 comments

Hi - I'm running into an error when trying to write a DaskGeoDataFrame. I'm following the basic pattern here (see also) but using a smaller sample of a point dataset. Everything seems to run as expected until trying to write out the packed file and I encounter the error below.

ALL software version info

pyarrow =15.0.0 spatialpandas=0.4.10 pandas=2.1.1 dask=2024.2.0 python=3.9.16

df = df.pack_partitions(npartitions=df.npartitions, shuffle='disk')
df.to_parquet(save_path)

image

image

4andy avatar Feb 23 '24 17:02 4andy

I was able to get a small file written without error but I still encounter the error with a large dataset.

I re-ran on a different system with pandas 2.2.1 and again with pandas 1.5.3 and encountered the error each time. Any ideas are appreciated. Here is a more complete stack trace image

4andy avatar Feb 27 '24 18:02 4andy

If there is only one Dataframe partition saving works fine - if there is > 1 partition, this error is returned.

4andy avatar Feb 27 '24 19:02 4andy

I would guess that this was implemented with fastparquet, which has now been dropped by Dask. Can you try downgrading the Dask version to something like 2020 and see if that will work with/without fastparquet.

hoxbro avatar Feb 28 '24 06:02 hoxbro

Thanks for that idea @Hoxbro. I downgraded dask to 2020 but it returns the same error.

So far in looking into the issue I found that any call to df.geometry.total_bounds after df.pack_partitions() raises the error. However, you can call the total_bounds property any number of times before packing partitions and it returns correctly.

4andy avatar Feb 28 '24 13:02 4andy

Did you try to set the parquet backend to fastparquet?

hoxbro avatar Feb 28 '24 14:02 hoxbro

I did try fastparquet (same error). However, I don't think it's related to that or to saving directly. Something happens with pack_partitions that causes and future calls to the geometry.total_bounds property to fail. It's failing at save because to_parquet makes calls to that property.

4andy avatar Feb 28 '24 15:02 4andy

I found a trigger condition for the error - it occurs when one or more longitudes are negative. I attached a simple notebook that reproduces the error. If you change the negative longitude to positive the error is resolved. Not sure where to look in the code to patch this. Thanks! sp_error_example.ipynb.txt

4andy avatar Feb 28 '24 21:02 4andy