kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

feat(datasets): Replace geopandas.GeoJSONDataset with geopandas.GenericDataset

Open harm-matthias-harms opened this issue 1 year ago • 12 comments

Description

Geopandas supports reading parquet and feather files since version 0.8.0. These offer performance improvements over classical file types, such as .geojson or .shp.zip (supported by geopandas.GeoJSONDataset). We have used a private implementation for some time, but want to contribute them to kedro-datasets. This PR closes https://github.com/kedro-org/kedro-plugins/issues/196.

Development notes

Extended the implementation of geopandas.GeoJSONDataset with an optional file_formatparameter which is needed to use custom read_* and to_* methods. Replaced the geopandas.GeoJSONDataset with geopandas.GenericDataset.

Checklist

  • [x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • [x] Updated the documentation to reflect the code changes
  • [x] Added a description of this change in the relevant RELEASE.md file
  • [x] Added tests to cover my changes

harm-matthias-harms avatar Aug 21 '24 13:08 harm-matthias-harms

If the implementation is okay, I can add the feather dataset in the same PR or open a new one.

harm-matthias-harms avatar Aug 21 '24 13:08 harm-matthias-harms

Thanks for this contribution @harm-matthias-harms ! One question: do you think we could have some sort of geopandas.GenericDataset, like we do for pandas and Polars? (Just so that we avoid proliferation of datasets)

To be honest, that would be my preferred solution as well. The current approach feels like a lot of duplicated boilerplate. I really like the Polars dataset and will try to modify the GeoJSON dataset tomorrow to make it work more generally.

harm-matthias-harms avatar Aug 21 '24 15:08 harm-matthias-harms

My worry is that GeoJSONDataset is no longer GeoJSON-exclusive. So I'm wondering if we should more or less take the code you created here, but put it in a geopandas.GenericGeoDataset, and leave geopandas.GeoJSONDataset alone.

@astrojuanlu I see two options:

  1. GeoJSON is also not GeoJSON exclusive today because I can read many more formats than geojson with read_file.
  2. We create a GenericDataset (this fixes the naming and offers the same functionality and more ) and deprecate the GeoJSONDataset to remove it in some future major versions. Then we don't have to maintain the same dataset twice in the future.

Please tell me how you would like to proceed here.

harm-matthias-harms avatar Aug 28 '24 08:08 harm-matthias-harms

GeoJSON is also not GeoJSON exclusive today because I can read many more formats than geojson with read_file.

Hah, true... I see this wasn't discussed in https://github.com/kedro-org/kedro/pull/190 originally.

I'm actually in favour of having a new dataset with a more generic name and deprecate the old one, but wondering if the churn is worth the advantages. Any thoughts @merelcht @noklam ?

astrojuanlu avatar Aug 28 '24 08:08 astrojuanlu

I'm actually in favour of having a new dataset with a more generic name and deprecate the old one, but wondering if the churn is worth the advantages. Any thoughts @merelcht @noklam ?

We can just remove the old one and add a new one with a generic name in a breaking release. I don't think it's worth the effort of doing a non-breaking release with a deprecation warning first and then remove the old one in the next breaking release just for this one dataset.

We could do a TSC vote if we feel the nature of the dataset changes too much.

merelcht avatar Sep 03 '24 13:09 merelcht

Let's proceed that way then 👍🏼 @harm-matthias-harms are you willing to update the PR accordingly? Namely:

just remove the old one and add a new one with a generic name in a breaking release.

astrojuanlu avatar Sep 04 '24 12:09 astrojuanlu

@astrojuanlu I updated the PR. Since I merged main the tests have been failing. It looks like a fsspec problem. I don't know much about ffspec, maybe you have a good idea of how to fix this.

Last successful run: https://github.com/kedro-org/kedro-plugins/actions/runs/10593320929/job/29354563096?pr=812 First run that failed: https://github.com/kedro-org/kedro-plugins/actions/runs/10719091250/job/29722424013?pr=812

harm-matthias-harms avatar Sep 05 '24 12:09 harm-matthias-harms

I do not know.

Does geopandas handle remote filepaths? Like, if we do geopandas.read_{self._file_format}("s3://...") directly without

https://github.com/harm-matthias-harms/kedro-plugins/blob/9e88c5ee0284865b1256a5667bd6f7b691ddb3f7/kedro-datasets/kedro_datasets/geopandas/generic_dataset.py#L155-L156

will it work?

astrojuanlu avatar Sep 11 '24 14:09 astrojuanlu

@astrojuanlu This was also a thought I had, but that currently makes more tests fail, and it may be better to have all datasets similar. I don't know why the CI fails.

pytest kedro_datasets/geopandas/generic_dataset.py --doctest-modules --doctest-continue-on-failure and pytest tests/geopandas/test_generic_dataset.py both pass locally with python 3.12. It looks like I have to investigate the CI a bit more. I hope I find some time at the end of the week for that.

harm-matthias-harms avatar Sep 11 '24 17:09 harm-matthias-harms

Thanks for your patience! Let us know how it goes

astrojuanlu avatar Sep 11 '24 18:09 astrojuanlu

@astrojuanlu This was also a thought I had, but that currently makes more tests fail, and it may be better to have all datasets similar. I don't know why the CI fails.

pytest kedro_datasets/geopandas/generic_dataset.py --doctest-modules --doctest-continue-on-failure and pytest tests/geopandas/test_generic_dataset.py both pass locally with python 3.12. It looks like I have to investigate the CI a bit more. I hope I find some time at the end of the week for that.

It has to do with the docstring part and seems to affect python 3.10 specifically. I'm able to replicate the error locally, but haven't found what's wrong yet.

merelcht avatar Sep 12 '24 16:09 merelcht

@astrojuanlu I figured out the problem... I also bumped geopandas to v1. They use pyogrio as the new default over fiona. pyogrio does not yet support an alternate file system like fsspec (https://github.com/geopandas/pyogrio/issues/430). There are workarounds using buffers, but the simplest way would be to change the used engine to fiona until pyogrio supports fsspec.

harm-matthias-harms avatar Sep 13 '24 10:09 harm-matthias-harms

@astrojuanlu, do you wanna add anything before we merge?

ElenaKhaustova avatar Oct 09 '24 19:10 ElenaKhaustova