kedro-plugins feat(datasets): Add xarray and GeoTiffDataset

trafficstars

Description

Added GeoTiffDataset to read and write geotiff files with rioxarray and work with xarray in kedro. This is also related to issue #165

Development notes

Added kedro_datasets/xarray/geotiff_dataset.py and a very basic framework for testing in /kedro-datasets/tests/xarray/test_geotiff_dataset.py using a sample file from rioxarray. I could not run the test suit locally or on github due to the complex environment. So more tests are certainly needed. I think it would also be easy to make a preview method.

Checklist

[x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
[x] Updated the documentation to reflect the code changes
[x] Added a description of this change in the relevant RELEASE.md file
[x] Added tests to cover my changes

Sep 28 '23 12:09 tgoelles

Thanks @tgoelles for this pull request 🚀

I could not run the test suit locally or on github due to the complex environment.

What instructions did you follow? Would love to understand how can we make the process easier. It's definitely a challenge to install all the dependencies of all datasets, but we could try to make running the tests of only 1 dataset easier.

Sep 29 '23 09:09 astrojuanlu

Thanks @tgoelles for this pull request 🚀

I could not run the test suit locally or on github due to the complex environment.

What instructions did you follow? Would love to understand how can we make the process easier. It's definitely a challenge to install all the dependencies of all datasets, but we could try to make running the tests of only 1 dataset easier.

I did not follow any instruction. Is there one? Yes It would be great to just test one dataset and not on windows at the moment with the github actions. VS code debugging does not work for some reason: discovery error. Also if I want to run it in the terminal I get this:

"no tests ran" even when I runnning yaml for example

Oct 03 '23 04:10 tgoelles

A basic test suite now works. There I need help, but the basic functionality works and I guess the rest can be added by someone who knows kedro better

Oct 05 '23 11:10 tgoelles

I did not follow any instruction. Is there one? Yes It would be great to just test one dataset and not on windows at the moment with the github actions. VS code debugging does not work for some reason: discovery error. Also if I want to run it in the terminal I get this:

After you commented this I realized that this plugin didn't have a dedicated CONTRIBUTING.md (https://github.com/kedro-org/kedro-plugins/issues/364) and that's fixed now: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/CONTRIBUTING.md#unit-tests-100-coverage-pytest-pytest-cov

This however runs all the tests. You can try pytest -k xarray to only run those related to xarray. Let me know if that helps.

Oct 13 '23 12:10 astrojuanlu

Just one broad question: are there any parts of this dataset that are GeoTIFF-specific, and that prevent it to work on generic n-dimensional labeled arrays from xarray?

It uses rioxarray to handle GeoTiffs: https://corteva.github.io/rioxarray/stable/getting_started/getting_started.html

Oct 13 '23 13:10 tgoelles

Hello @tgoelles! We'd love to get this PR in. How can we help?

Nov 10 '23 17:11 astrojuanlu

Hello @tgoelles! We'd love to get this PR in. How can we help?

I am currently super busy with other things. But I think its ready and someone needs to see that all the checks are Ok. I might have time in 2 weeks again for this.

Nov 21 '23 06:11 tgoelles

I've resolved the merge conflicts for this PR and done some small cleanups, but there's still more things to do:

The doctest is failing for the example provided in GeoTiffDataset. This is the python example inside the docstring.
Test coverage is not at 100% for GeoTiffDataset:

Name                                                         Stmts   Miss  Cover   Missing
------------------------------------------------------------------------------------------
kedro_datasets/xarray/geotiff_dataset.py     42      6    86%   100, 103, 117-118, 132-133

@tgoelles Can you finish this or do you want someone to take over?

Jan 04 '24 09:01 merelcht

cc @riley-brady FYI :)

Feb 28 '24 18:02 astrojuanlu

I've been doing some small bits fixing the doc string and added some missing tests, but while doing this I also realised I'm probably not the right person to finish the implementation. I'm not familiar at all with GeoTiff and its purpose. I noticed the dataset as it stands now doesn't have the credentials argument and is maybe missing some other bits when I compare it to e.g. NetCDFDataset that was merged yesterday.

Either someone with more knowledge about this data type should finish this PR or we can take this as our first "experimental" contribution (see #517) and get it in that way. WDYT @astrojuanlu?

Feb 29 '24 10:02 merelcht

Hello! just a note: I think the name GeoTiffDataset is a bit misleading because rioxarray wraps rastertio.open() which can open any kind of raster dataset. It can open any format supported by gdal raster drivers. So maybe something in the line of RasterDataset would be more strict and avoid possible confusions.

May 14 '24 10:05 BielStela

Hi @tgoelles, I'm aware that this PR has been open for quite a while and has had a lot of comments. I'm not sure the Kedro team can help at this point to get the dataset to a full ready state. However, we're about to launch our new experimental dataset contribution model, which basically means you can contribute datasets that are more experimental and don't have to have full test coverage etc here https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental.

I think the GeoTiffDataset might be a good candidate to get contributed in that way. What do you think?

May 15 '24 16:05 merelcht

I think the GeoTiffDataset might be a good candidate to get contributed in that way. What do you think?

Yes I think that would be a possible way. Also as @BielStela pointed out it is for geospatial raster data. So there is also an overlap with the already now existing netcdf format.

I could be like this. RasterDataset => rioxarray. Then we could also think about enforcing that a coordinate reference system (CRS) is used. Or at least make an option for this. Also this needs to work well together with other geospatial datasets like netcdf and geopandas.

Yes this needs more experimenting, but it would introduce a huge list of new file formats which can be used.

Will you move it to the experimental folder? I could then rework it with the new concept of dealing with raster data in general and not just GeoTiff. I also gained a lot of experience working with these kind of data recently.

May 16 '24 05:05 tgoelles

I'm happy to give a hand with this if needed @tgoelles :)

May 16 '24 15:05 BielStela

@tgoelles I've moved it! There's some issues with the docs, which I'll continue to fix, but you should be able to work on the code 🙂

May 17 '24 14:05 merelcht

Hello! just a note: I think the name GeoTiffDataset is a bit misleading because rioxarray wraps rastertio.open() which can open any kind of raster dataset. It can open any format supported by gdal raster drivers. So maybe something in the line of RasterDataset would be more strict and avoid possible confusions.

GeoRasterDataset? 😇

(I'm looking forward to seeing this merged by the way!)

May 21 '24 14:05 astrojuanlu

@tgoelles I've moved it! There's some issues with the docs, which I'll continue to fix, but you should be able to work on the code 🙂

All docs issues are now resolved as well.

May 23 '24 13:05 merelcht

Rendered version: https://kedro--355.org.readthedocs.build/projects/kedro-datasets/en/355/api/kedro_datasets_experimental.xarray.GeoTiffDataset.html

May 23 '24 13:05 astrojuanlu

I renamed everything to get startet. I think rioxarray.RasterDataset makes sense. This indicates what library is actually used and that it is raster data. GeoRasterDataset would be too much IMO because rioxarray implies that it is geospatial data and the geoscience community knows what raster data is.

Now the tests of the experimental datasets do not run in the github action. Also for the tests we need the same conftest.py as from the "normal" datasets. I guess you need a strategy for that. Locally the existing tests run on my macbook if I make a symlink to the conftest.py.

Maybe have the experimental dataset tests in the same tests folder as the other ones?

May 24 '24 12:05 tgoelles

Now the tests of the experimental datasets do not run in the github action.

Yes this is on purpose, because we don't require experimental datasets to have tests to make it easier to contribute. Additionally, ensuring all tests pass and unblocking builds is quite a heavy burden for the maintenance team. For the "regular" datasets like CSVDataset debugging isn't too hard, but more niche datasets like this one can be hard for us to fix when broken. If a dataset gets a lot of traction and the Kedro TSC feels like we can start maintaining it fully, tests will become mandatory.

May 31 '24 12:05 merelcht

@tgoelles Feel free to address the conflicts of the release notes, signoff any remaining commits for the DCO check to pass, and we'll be happy to merge soon :)

Jun 03 '24 10:06 astrojuanlu

I decided to name it now GeotiffDataset again. This way it would be possible to have different datasets in the future based on rioxarray similar to pandas. It can only read and write geotiffs currently, because it would have been a nightmare to support everything gdal can read and write. Also I enforced the use of a coordinate reference system.

The versioning and S3 stuff might be quite straight forward to implement now. I need to test it a bit more on my current project and might implement that soonish.

Jun 05 '24 07:06 tgoelles

I decided to name it now GeotiffDataset again. This way it would be possible to have different datasets in the future based on rioxarray similar to pandas. It can only read and write geotiffs currently, because it would have been a nightmare to support everything gdal can read and write. Also I enforced the use of a coordinate reference system.

Fair enough!

The versioning and S3 stuff might be quite straight forward to implement now. I need to test it a bit more on my current project and might implement that soonish.

Do you want to do it in this PR, or should we review what's in here already?

Jun 05 '24 08:06 astrojuanlu

Do you want to do it in this PR, or should we review what's in here already?

Please review what is there. Then I want to test it a bit further in real world code and then make a new PR and it could then be in the official datasets maybe.

Jun 05 '24 10:06 tgoelles

Liberally re-requesting a review from @RobertoPalomares and @riley-brady if you have the time 🙏🏼

Jun 05 '24 10:06 astrojuanlu

I don't know why this pandas check failed. it is the same code as in the main branch. I tried to solve the DCO, following the instructions of it with git rebase HEAD~228 --signoff, but this than leads to a huge amount of conflicts. there must be an easier way out of this. Also I am afraid to make mistakes there.

Jun 07 '24 05:06 tgoelles

I tried to solve the DCO, following the instructions of it with git rebase HEAD~228 --signoff, but this than leads to a huge amount of conflicts. there must be an easier way out of this. Also I am afraid to make mistakes there.

You're not the first one...

I see maintainers are allowed to push to this branch, so I'll try to get you some help from the team.

Jun 10 '24 07:06 astrojuanlu

CI failure is unrelated https://github.com/kedro-org/kedro-plugins/issues/740#issuecomment-2207612753

Jul 04 '24 06:07 astrojuanlu

This is merged :shipit: thanks again everyone for your patience!

Jul 05 '24 11:07 astrojuanlu

kedro-plugins kedro-plugins copied to clipboard

feat(datasets): Add xarray and GeoTiffDataset

Description

Development notes

Checklist

kedro-plugins
kedro-plugins copied to clipboard