kedro-plugins
kedro-plugins copied to clipboard
feat(datasets): Add xarray and GeoTiffDataset
Description
- Added GeoTiffDataset to read and write geotiff files with rioxarray and work with xarray in kedro. This is also related to issue #165
Development notes
Added kedro_datasets/xarray/geotiff_dataset.py and a very basic framework for testing in /kedro-datasets/tests/xarray/test_geotiff_dataset.py using a sample file from rioxarray. I could not run the test suit locally or on github due to the complex environment. So more tests are certainly needed. I think it would also be easy to make a preview method.
Checklist
- [x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
- [x] Updated the documentation to reflect the code changes
- [x] Added a description of this change in the relevant
RELEASE.mdfile - [x] Added tests to cover my changes
Thanks @tgoelles for this pull request 🚀
I could not run the test suit locally or on github due to the complex environment.
What instructions did you follow? Would love to understand how can we make the process easier. It's definitely a challenge to install all the dependencies of all datasets, but we could try to make running the tests of only 1 dataset easier.
Thanks @tgoelles for this pull request 🚀
I could not run the test suit locally or on github due to the complex environment.
What instructions did you follow? Would love to understand how can we make the process easier. It's definitely a challenge to install all the dependencies of all datasets, but we could try to make running the tests of only 1 dataset easier.
I did not follow any instruction. Is there one? Yes It would be great to just test one dataset and not on windows at the moment with the github actions. VS code debugging does not work for some reason: discovery error. Also if I want to run it in the terminal I get this:
"no tests ran" even when I runnning yaml for example
A basic test suite now works. There I need help, but the basic functionality works and I guess the rest can be added by someone who knows kedro better
I did not follow any instruction. Is there one? Yes It would be great to just test one dataset and not on windows at the moment with the github actions. VS code debugging does not work for some reason: discovery error. Also if I want to run it in the terminal I get this:
After you commented this I realized that this plugin didn't have a dedicated CONTRIBUTING.md (https://github.com/kedro-org/kedro-plugins/issues/364) and that's fixed now: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/CONTRIBUTING.md#unit-tests-100-coverage-pytest-pytest-cov
This however runs all the tests. You can try pytest -k xarray to only run those related to xarray. Let me know if that helps.
Just one broad question: are there any parts of this dataset that are GeoTIFF-specific, and that prevent it to work on generic n-dimensional labeled arrays from xarray?
It uses rioxarray to handle GeoTiffs: https://corteva.github.io/rioxarray/stable/getting_started/getting_started.html
Hello @tgoelles! We'd love to get this PR in. How can we help?
Hello @tgoelles! We'd love to get this PR in. How can we help?
I am currently super busy with other things. But I think its ready and someone needs to see that all the checks are Ok. I might have time in 2 weeks again for this.
I've resolved the merge conflicts for this PR and done some small cleanups, but there's still more things to do:
- The
doctestis failing for the example provided inGeoTiffDataset. This is the python example inside the docstring. - Test coverage is not at 100% for
GeoTiffDataset:
Name Stmts Miss Cover Missing
------------------------------------------------------------------------------------------
kedro_datasets/xarray/geotiff_dataset.py 42 6 86% 100, 103, 117-118, 132-133
@tgoelles Can you finish this or do you want someone to take over?
cc @riley-brady FYI :)
I've been doing some small bits fixing the doc string and added some missing tests, but while doing this I also realised I'm probably not the right person to finish the implementation. I'm not familiar at all with GeoTiff and its purpose. I noticed the dataset as it stands now doesn't have the credentials argument and is maybe missing some other bits when I compare it to e.g. NetCDFDataset that was merged yesterday.
Either someone with more knowledge about this data type should finish this PR or we can take this as our first "experimental" contribution (see #517) and get it in that way. WDYT @astrojuanlu?
Hello! just a note: I think the name GeoTiffDataset is a bit misleading because rioxarray wraps rastertio.open() which can open any kind of raster dataset. It can open any format supported by gdal raster drivers. So maybe something in the line of RasterDataset would be more strict and avoid possible confusions.
Hi @tgoelles, I'm aware that this PR has been open for quite a while and has had a lot of comments. I'm not sure the Kedro team can help at this point to get the dataset to a full ready state. However, we're about to launch our new experimental dataset contribution model, which basically means you can contribute datasets that are more experimental and don't have to have full test coverage etc here https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental.
I think the GeoTiffDataset might be a good candidate to get contributed in that way. What do you think?
I think the
GeoTiffDatasetmight be a good candidate to get contributed in that way. What do you think?
Yes I think that would be a possible way. Also as @BielStela pointed out it is for geospatial raster data. So there is also an overlap with the already now existing netcdf format.
I could be like this. RasterDataset => rioxarray. Then we could also think about enforcing that a coordinate reference system (CRS) is used. Or at least make an option for this. Also this needs to work well together with other geospatial datasets like netcdf and geopandas.
Yes this needs more experimenting, but it would introduce a huge list of new file formats which can be used.
Will you move it to the experimental folder? I could then rework it with the new concept of dealing with raster data in general and not just GeoTiff. I also gained a lot of experience working with these kind of data recently.
I'm happy to give a hand with this if needed @tgoelles :)
@tgoelles I've moved it! There's some issues with the docs, which I'll continue to fix, but you should be able to work on the code 🙂
Hello! just a note: I think the name
GeoTiffDatasetis a bit misleading because rioxarray wrapsrastertio.open()which can open any kind of raster dataset. It can open any format supported by gdal raster drivers. So maybe something in the line ofRasterDatasetwould be more strict and avoid possible confusions.
GeoRasterDataset? 😇
(I'm looking forward to seeing this merged by the way!)
@tgoelles I've moved it! There's some issues with the docs, which I'll continue to fix, but you should be able to work on the code 🙂
All docs issues are now resolved as well.
Rendered version: https://kedro--355.org.readthedocs.build/projects/kedro-datasets/en/355/api/kedro_datasets_experimental.xarray.GeoTiffDataset.html
I renamed everything to get startet. I think rioxarray.RasterDataset makes sense. This indicates what library is actually used and that it is raster data. GeoRasterDataset would be too much IMO because rioxarray implies that it is geospatial data and the geoscience community knows what raster data is.
Now the tests of the experimental datasets do not run in the github action. Also for the tests we need the same conftest.py as from the "normal" datasets. I guess you need a strategy for that. Locally the existing tests run on my macbook if I make a symlink to the conftest.py.
Maybe have the experimental dataset tests in the same tests folder as the other ones?
Now the tests of the experimental datasets do not run in the github action.
Yes this is on purpose, because we don't require experimental datasets to have tests to make it easier to contribute. Additionally, ensuring all tests pass and unblocking builds is quite a heavy burden for the maintenance team. For the "regular" datasets like CSVDataset debugging isn't too hard, but more niche datasets like this one can be hard for us to fix when broken. If a dataset gets a lot of traction and the Kedro TSC feels like we can start maintaining it fully, tests will become mandatory.
@tgoelles Feel free to address the conflicts of the release notes, signoff any remaining commits for the DCO check to pass, and we'll be happy to merge soon :)
I decided to name it now GeotiffDataset again. This way it would be possible to have different datasets in the future based on rioxarray similar to pandas. It can only read and write geotiffs currently, because it would have been a nightmare to support everything gdal can read and write. Also I enforced the use of a coordinate reference system.
The versioning and S3 stuff might be quite straight forward to implement now. I need to test it a bit more on my current project and might implement that soonish.
I decided to name it now GeotiffDataset again. This way it would be possible to have different datasets in the future based on rioxarray similar to pandas. It can only read and write geotiffs currently, because it would have been a nightmare to support everything gdal can read and write. Also I enforced the use of a coordinate reference system.
Fair enough!
The versioning and S3 stuff might be quite straight forward to implement now. I need to test it a bit more on my current project and might implement that soonish.
Do you want to do it in this PR, or should we review what's in here already?
Do you want to do it in this PR, or should we review what's in here already?
Please review what is there. Then I want to test it a bit further in real world code and then make a new PR and it could then be in the official datasets maybe.
Liberally re-requesting a review from @RobertoPalomares and @riley-brady if you have the time 🙏🏼
I don't know why this pandas check failed. it is the same code as in the main branch. I tried to solve the DCO, following the instructions of it with git rebase HEAD~228 --signoff, but this than leads to a huge amount of conflicts. there must be an easier way out of this. Also I am afraid to make mistakes there.
I tried to solve the DCO, following the instructions of it with git rebase HEAD~228 --signoff, but this than leads to a huge amount of conflicts. there must be an easier way out of this. Also I am afraid to make mistakes there.
You're not the first one...
I see maintainers are allowed to push to this branch, so I'll try to get you some help from the team.
CI failure is unrelated https://github.com/kedro-org/kedro-plugins/issues/740#issuecomment-2207612753
This is merged :shipit: thanks again everyone for your patience!