dask-geopandas icon indicating copy to clipboard operation
dask-geopandas copied to clipboard

Using dtype with dask-geopandas

Open Titou325 opened this issue 3 years ago • 2 comments

Hi!

I am trying to wrap my head around using dtype along with dask-geopandas. dask.DataFrame supports providing dtype by forwarding it to the underlying pandas call (see the dask.DataFrame.read_csv documentation). This allows us to better optimize memory storage as well as to prevent some unavoidable errors, especially when using datetimes that are outside of pandas' supported date range (much in the future or past). However, this does not seem supported by dask-geopandas nor as a direct option in pyogrio.

It seems to me that support could be added by allowing additional arguments in pyogrio when calling the pandas.DataFrame function, which could then be propagated to dask-geopandas by updating the calls to pyogrio.

Is there any special attention or element that should be brought to the table preventing this modification from being made?

Thanks a lot and have a nice day,

Titou325 avatar Aug 02 '22 16:08 Titou325

cc @brendan-ward

martinfleis avatar Aug 27 '22 18:08 martinfleis

Pyogrio dtypes are based on the best match to the underlying concrete data type of a specific dataset. CSV files don't have embedded dtypes, so there is certainly a benefit to providing those to the reader. However, for Pyogrio, I don't think that makes sense; if you want to override the dtypes, that incurs data type casting which should happen outside Pyogrio (e.g., float64 to float32) since it really isn't something that should happen in the direct read using GDAL.

As for date types outside their value range, there are some issues in pyogrio that are outside our control when we create pandas DataFrames from the underlying numpy arrays. Do you provide a different dtype to avoid the pandas issue with out-of-range values?

brendan-ward avatar Aug 29 '22 20:08 brendan-ward

Hello @brendan-ward sorry for the late response,

Yes for these dates we pass a string/object dtype to pandas to prevent type inference which results in out of bound dates. We do not need actual processing of the dates but the auto inference prevents us from loading them as strings.

We thus use a patched method which passes the dtypes down and skips the inference.

I can provide you with a snippet of our patched function if that can help.

Have a nice day,

Titou325 avatar Nov 03 '22 18:11 Titou325

@Titou325 yes, please share the snippet. Would you mind doing so in a new issue in pyogrio instead of here?

brendan-ward avatar Nov 03 '22 18:11 brendan-ward

Opened as geopandas/pyogrio#174

Titou325 avatar Nov 04 '22 07:11 Titou325