kedro-plugins
kedro-plugins copied to clipboard
Mulitband rasters saved incorrectly in kedro_datasets_experimental.rioxarray.GeoTIFFDataset
Description
Hi, thanks for creating this dataset plugin for us geospatial folks.
The GeoTIFFDataset uses rioxarray to open and write datasets, except in the case of multi-band rasters. In this case, a custom _save_multiband function will use rasterio to write the data.
https://github.com/kedro-org/kedro-plugins/blob/159e0a3e45ac81e6465c6bb010492f33f7e98064/kedro-datasets/kedro_datasets_experimental/rioxarray/geotiff_dataset.py#L137-L138
There are three issues with this implementation, which result in the saved dataset to not matching the loaded dataset.
- The raster transform is calculated from the minimum and maximum lon/lat values, shrinking the extent by half a pixel on all sides.
- xarray attrs are not written to geotiff tags as expected.
- A nodata value is assigned when none exists.
In my pipelines, I've worked around the issue by writing a simple custom dataset that always uses the standard dataarray.rio.to_raster(), which appears to work as expected. However, given that effort was put into writing the _save_multiband in the first place, I feel like I'm missing something.
# custom dataset
class GeoTIFFDataset(geotiff_dataset.GeoTIFFDataset):
def _save(self, data: xarray.DataArray) -> None:
self._sanity_check(data)
save_path = get_filepath_str(self._get_save_path(), self._protocol)
data.rio.to_raster(save_path, **self._save_args)
self._fs.invalidate_cache(save_path)
Context
This bug affects processing all raster datasets with a band dimension, e.g. landsat/sentinel data
Steps to Reproduce
Load and save a geotiff dataset with bands
# node
def copy_geotiff(input_geotiff: xarray.DataArray) -> xarray.DataArray:
return input_geotiff
Expected Result
The output should be identical to the input
Thank you for your query. I am looping in the creator of this dataset @tgoelles to assist with the above.
Happy to see that geospatial folks are using it.
I had a reason to make a separate _save_multiband, but don't remember 🫣. I think something was missing in the rio.to_raster or it did not work back then.
Would be great if we could eliminate the need for _save_multiband.
I guess your issue 1 is the most critical one and a bit surprising. Maybe you can alaborate a bit on it.
I think number 2 can easily be implemented.
Number 3 is an opinionated move I guess, I added a default NAN value to be sure to have one. I had a horrible bug down the line related to pytorch and torchgeo which was hard to find because of it.
Thanks, for sure.
On why you might have written this. I see in the documentation for rioxarray, that xr.Dataset.rio.to_raster() only supports 2d arrays. Whereas xr.DataArray.rio.to_raster() doesn't have that restriction. The type annotations suggest you should always be getting DataArrays, but maybe somewhere along the way you were working with xr.Datasets?
For 1. Here, the data.x.min() will give you the center of the leftmost pixel, rather than the edge.
https://github.com/kedro-org/kedro-plugins/blob/159e0a3e45ac81e6465c6bb010492f33f7e98064/kedro-datasets/kedro_datasets_experimental/rioxarray/geotiff_dataset.py#L163-L170
You can compare:
In [19]: rio.transform.from_bounds(arr.x.min(), arr.y.min(), arr.x.max(), arr.y.max(), arr[0].shape[1], arr[0].shape[0])
Out[19]:
Affine(<xarray.DataArray 'x' ()> Size: 8B
array(9.9609375)
Coordinates:
spatial_ref int64 8B 0, <xarray.DataArray 'y' ()> Size: 8B
array(0.)
Coordinates:
spatial_ref int64 8B 0, <xarray.DataArray 'x' ()> Size: 8B
array(250162.63166085)
Coordinates:
spatial_ref int64 8B 0,
<xarray.DataArray 'x' ()> Size: 8B
array(0.)
Coordinates:
spatial_ref int64 8B 0, <xarray.DataArray 'y' ()> Size: 8B
array(-9.9609375)
Coordinates:
spatial_ref int64 8B 0, <xarray.DataArray 'y' ()> Size: 8B
array(4317235.02328553)
Coordinates:
spatial_ref int64 8B 0)
In [20]: arr.rio.transform()
Out[20]:
Affine(10.0, 0.0, 250157.63166085022,
0.0, -10.0, 4317240.023285527)
Hi, I also belong to the group of geo folks using kedro! I want to get involved with the project and taking a look around I found this issue.
I'd like to help with this since I already had to make a rasterio-based GeoTiffDataset for our company. I think tests could highlight all the issues you have been discussing and I'm willing to write them if it can be helpful. Do experimental datasets have a test folder? I can't find one.
Also, I think it could be useful to have a rasterio-numpy implementation of this class, to be used in light environments. Tests could be beneficial to confirm different implementations of the same class using different underlying libraries work in similar ways.
EDIT: I forked the repo and tried to draft a couple of tests, and indeed I could replicate the first bug mentioned (wrong transform):
AssertionError: assert Affine(0.9, 0....0, -0.9, 9.0) == Affine(1.0, 0....0, 1.0, -0.5)
Thanks for looking at this @gionnid !
Indeed I don't think there's a tests folder for experimental datasets, ~~@merelcht what do you think of creating it under kedro-datasets/tests/experimental~~ ?
The tests directory for the experimental datasets is here: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental/tests
Hey @fgassert, are you interested in taking this issue and contributing to experimental datasets?
This issue has been closed due to lack of information. Feel free to re-open this issue if you're facing a similar problem. Please provide as much information as possible so we can help resolve your issue.