torchgeo RasterDataset: add control over resampling algorithm

By default, rasterio.merge.merge uses "nearest" as its resampling algorithm. This is fast and works well for masks where you don't want to interpolate two categorical classes, but results in resampling artifacts for floating point images. This PR adds an attribute to control the resampling algorithm used. It defaults to "bilinear" for float (usually images) and "nearest" for int (usually masks).

I chose to base this on dtype instead of is_image because there are times when you want to interpolate floating point pixelwise regression masks.

@robmarkcole can you upload before and after plots from the example you showed me?

Closes #2012

Apr 20 '24 09:04 adamjstewart

Doesn't seem to have as big of an impact on I/O performance as I expected:

	raw (random)	raw (grid)	preprocessed (random)	preprocessed (grid)
before	17.350	10.984	9.9158	4.6496
after	18.706	12.371	9.5933	4.5972

Apr 20 '24 09:04 adamjstewart

This approach doesn't work well for datasets like L7 Irish or L8 Biome where a single RasterDataset is used for both image and mask. I think the solution is to instead subclass IntersectionDataset like I did in #1972.

Apr 20 '24 09:04 adamjstewart

I think cubic interpolation is not such a good idea because the corresponding convolution kernel has negative weights, meaning that the output data range is not guaranteed.

This can really mess up any normalization or scaling applied to the data before the resampling step.

Here’s a minimal example of what I’m talking about:

import numpy as np
import rasterio

dst_data = data = np.random.rand(512, 512)

dst: rasterio.io.DatasetWriter
with rasterio.open(
    "temp.tif", mode="w", count=1, dtype=np.float32, width=512, height=512
) as dst:
    dst.write(dst_data, indexes=1)

src: rasterio.io.DatasetReader
with rasterio.open("temp.tif") as src:
    src_data = src.read(
        out_shape=(src.count, int(src.height * 2), int(src.width * 2)),
        resampling=rasterio.enums.Resampling.cubic,
    )
    assert (
        src_data.min() >= 0 and src_data.max() < 1
    ), "The cubic kernel has negative weights!"

This can also cause issues with no-data values; there may be huge inaccuracies in the final product depending on where they land in relation to the kernel.

I think bilinear interpolation is fine for 99.9% of cases.

Apr 22 '24 08:04 DimitrisMantas

This pair shows the original issue, addressed in this PR

This was created with a random sampler, checking for it, but in general the results are improved. There are however some kind of streak artefact:

Apr 22 '24 09:04 robmarkcole

You can play around with other enums and see which looks best: https://rasterio.readthedocs.io/en/stable/api/rasterio.enums.html#rasterio.enums.Resampling

I agree with @DimitrisMantas that we should pick a safe/simple/fast default.

Apr 22 '24 09:04 adamjstewart

I think this is more or less the one-liner for each method:

Nearest/Mode: Good for masks; not suitable for continuous fields (i.e., images)
Bilinear: Suitable for continuous data; works best with smooth-ish fields (e.g., DEMs & DSMs)
Cubic/Cubic Spline/Lanczos: These are all the same as far as we are concerned; they are a bit unpredictable.
Average: It sounds sounds a bit like bilinear, but it's probably a kernel.
Gauss: Produces a probably undesirable smoothing effect when upsampling images.
Min/Max/etc.: They are probably useful in certain regression tasks, but I can't think of a potential application at the moment.

Apr 22 '24 11:04 DimitrisMantas

By the way, Lanczos is technically "the best" of the bunch...

Apr 22 '24 11:04 DimitrisMantas

Agree with ^. It should be NN resampling for masks and Bilinear for all else by default. If a user wants something specific they can override.

Apr 22 '24 14:04 isaaccorley

torchgeo torchgeo copied to clipboard

RasterDataset: add control over resampling algorithm

torchgeo
torchgeo copied to clipboard