torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

FLAIR#2 Dataset and Datamodule Integration

Open MathiasBaumgartinger opened this issue 1 year ago • 23 comments

FLAIR#2 dataset

The FLAIR #2 <https://github.com/IGNF/FLAIR-2> dataset is an extensive dataset from the French National Institute of Geographical and Forest Information (IGN) that provides a unique and rich resource for large-scale geospatial analysis. The dataset is sampled countrywide and is composed of over 20 billion annotated pixels of very high resolution aerial imagery at 0.2 m spatial resolution, acquired over three years and different months (spatio-temporal domains).

The FLAIR2 dataset is a dataset for semantic segmentation of aerial images. It contains aerial images, sentinel-2 images and masks for 13 classes. The dataset is split into a training and test set.

Dataset features:

* over 20 billion annotated pixels
* aerial imagery
    * 5x512x512
    * 0.2m spatial resolution
    * 5 channels (RGB-NIR-Elevation)
* Sentinel-2 imagery
    * 10-20m spatial resolution
    * 10 spectral bands
    * snow/cloud masks (with 0-100 probability)
    * multiple time steps (T)
    * Tx10xWxH, T, W, H are variable
* label (masks)
    * 512x512
    * 13 classes

Dataset classes:

0: "building",
1: "pervious surface",
2: "impervious surface",
3: "bare soil",
4: "water",
5: "coniferous",
6: "deciduous",
7: "brushwood",
8: "vineyard",
9: "herbaceous vegetation",
10: "agricultural land",
11: "plowed land",
12: "other"  

If you use this dataset in your research, please cite the following paper:

* https://doi.org/10.48550/arXiv.2310.13336

image

Implementation Details

NonGeoDataset, __init()__

After discussions following https://github.com/microsoft/torchgeo/issues/2303, we decided that at least until faulty mask data are fixed the flair2 ds will be of type NonGeoDataset. Other than with common NonGeoDatasets, FLAIR2 exposes a use_toy and use_sentinel argument. The use_toy-flag will instead use the toy data which is a small subset of data. The use_sentinel argument on the other hand decides whether a sample includes the augmented sentinel data provided by the maintainers of FLAIR2.

_verify, _download, _extract

As each of the splits/sample-types (i.e. [train, test], [aerial, sentinel, labels] are contained in a individual zip download, download and extraction has to happen multiple times. On the other hand, the toy dataset is contained in a singular zip. Furthermore, to map the super-patches of the sentinel data to the actual input image, a flair-2_centroids_sp_to_patch.json is required, which has to be equally has to be downloaded as an individual zip.

_load_image, _load_sentinel, _load_target

For storage reasons, the elevation (5th band) of the image is stored as a uint. The original height thus is multiplied by 5. We decided to divide the height by 5 to get the original height, to make the trained model more usable for other data. See Questions please.

As mentioned previously, additional metadata has to be used to get from the sentinel.npy to the actual area. Initially for debugging reasons, we implemented to return not the cropped image but the original data and the cropping-slices (i.e. indices). Consequently, the images can be plot in a more meaningful matter. Otherwise, the resolution is so low that one can hardly recognize features. This was crucial for debugging to find the correct logic (classic y, x instead of x, y ordering mistake). We do not know if this is smart for "production code". See Questions please. Moreover, the dimensions of the sentinel data $T \times C=10 \times W \times H$ vary both $T$ and $W$, $H$. This is problematic for the datamodule. We have not done extensive research, but the varying dimensions seem to bug the module. Disabling the use_sentinel-flag will make the module work.

The labels include values from 1 to 19. The datapaper clearly mentions grouping classes $> 13$ into one class other due to underrepresentation. We followed this suggestion. Furthermore, rescaling from 0 to 12 was applied. See Questions please.

Questions

  • Do you consider the Elevation rescaling as distortion of the dataset? Shall I exclude it? The argument for it would be easier re-usability on new datasets.

For storage optimization reasons, this elevation information is multiplied by a factor of 5 and encoded as a 8bit unsigned integer datatype.

  • How shall we load/provide sentinel data? As cropped data or any other way. I do not see the current implementation as fit for production.

    • Also, how do we want to plot it? The small red rectangle in the example plot above is the actual region. The low resolution is quite observable there.
  • Shall we rescale the Classes to start from 0? Shall we group the classes as suggested in the datapaper?

  • Check integrity in download_url does not seem to work (in unit-tests), why?

    • I have to call an own check_integrity call otherwise it passes, even if md5s do not match.
  • The github actions on the forked repo produce a magic ruff error (https://github.com/MathiasBaumgartinger/torchgeo/actions/runs/11687694109/job/32556175383#step:7:1265). Can you help me resolve this mystery?

TODOs/FIXMEs

  • [x] Extend tests for toy datasets and apply md5 check
  • [x] Find correct band for plotting sentinel
  • [ ] Datamodule cannot handle sentinel data yet

MathiasBaumgartinger avatar Nov 05 '24 21:11 MathiasBaumgartinger

Thanks a lot for the PR, here are just some answers at the top of my head, others might have additional opinions.

Questions

* Do you consider the Elevation rescaling as distortion of the dataset? Shall I exclude it? The argument for it would be easier re-usability on new datasets.

Not sure about this one, but your proposition seems reasonable and as long as this is clearly documented, I would think it is fine.

* How shall we load/provide sentinel data? As cropped data or any other way. I do not see the current implementation as fit for production.

I think that depends on how, the large sentinel tile is expected to be used/useful for a model. The very low resolution area does not seem useful intuitively. I think having the flag you implemented is nice.

If you provide the entire tile, there will have to be a collate or cropping function in the datamodule, otherwise the dataloader cannot return consistent batches, as by default it tries to batch individual samples into a tensor by dict key.

  * Also, how do we want to plot it? The small red rectangle in the example plot above is the actual region. The low resolution is quite observable there.

I actually like that approach.

* Shall we rescale the Classes to start from 0? Shall we group the classes as suggested in the datapaper?

Classes should definitely be ordinally mapped to start from 0, otherwise there will be issues with the CrossEntropyLoss etc. Grouping is also okay since it follows the paper.

nilsleh avatar Nov 07 '24 15:11 nilsleh

Hi! Sorry for the late response.

First of all, thanks for the review. I will get to work on the proposed changes ;-)

MathiasBaumgartinger avatar Nov 21 '24 07:11 MathiasBaumgartinger

Apart from the sentinel data, I think everything is on track now. Let me know whether you have a preferred way of me handling this.

MathiasBaumgartinger avatar Nov 24 '24 16:11 MathiasBaumgartinger

I was just testing the pull request, and there was an issue with the download, where it gets a 404 when trying to download https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair-2_centroids_sp_to_patch.zip. I think reason stems from inconsistent naming, where the zip file is named flair_2_centroids_sp_to_patch.zip, with underscore after flair, and the json file inside is named flair-2_centroids_sp_to_patch.json, with hyphen after flair. I.e., the correct download url is https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair_2_centroids_sp_to_patch.zip.

Great work though! Awesome to see FLAIR being implemented :smiley:

JacobJeppesen avatar Nov 25 '24 11:11 JacobJeppesen

Apologize if the comments were a bit scattered, and not really a proper review. I was just trying to run the code, and commented on the go with any issues I encountered :slightly_smiling_face:

JacobJeppesen avatar Nov 25 '24 13:11 JacobJeppesen

I was just testing the pull request, and there was an issue with the download, where it gets a 404 when trying to download https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair-2_centroids_sp_to_patch.zip. I think reason stems from inconsistent naming, where the zip file is named flair_2_centroids_sp_to_patch.zip, with underscore after flair, and the json file inside is named flair-2_centroids_sp_to_patch.json, with hyphen after flair. I.e., the correct download url is https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair_2_centroids_sp_to_patch.zip.

Great work though! Awesome to see FLAIR being implemented 😃

Weird; I was sure I tried this. Thanks for letting me know.

MathiasBaumgartinger avatar Nov 25 '24 13:11 MathiasBaumgartinger

I was just testing the pull request, and there was an issue with the download, where it gets a 404 when trying to download https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair-2_centroids_sp_to_patch.zip. I think reason stems from inconsistent naming, where the zip file is named flair_2_centroids_sp_to_patch.zip, with underscore after flair, and the json file inside is named flair-2_centroids_sp_to_patch.json, with hyphen after flair. I.e., the correct download url is https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair_2_centroids_sp_to_patch.zip. Great work though! Awesome to see FLAIR being implemented 😃

Weird; I was sure I tried this. Thanks for letting me know.

Yeah, it seems like a weird error. Perhaps they changed it on their side, such that the zip-file naming was consistent. Although the json file inside still uses hyphen instead of underscore :slightly_smiling_face:

JacobJeppesen avatar Nov 25 '24 15:11 JacobJeppesen

It got all the correct aerial imagery files now and the download and extraction is working. However, I'm getting a much more annoying bug now: IReadBlock failed at X offset ... random error during training.

I've had this issue before with Rasterio, and it was due to concurrency issues. This shouldn't really happen here, as the dataloader just iterates through a list of files, and each sample is a separate tif file. I can see there's also been a discussion here (https://github.com/microsoft/torchgeo/discussions/594), so perhaps @adamjstewart or @calebrob6 knows what's going on. My own guess is that it could be multiple workers having GDAL do simultaneous scans of the same directory when opening individual files in said directory. I.e., there's often many image files in each sub-folder in the dataset, and GDAL will scan the sub-folder for metadata when opening a file inside it, which could be the culprit. For now at least, I've added the code below in datasets/flair2.py and is testing to see if it solves the issue.

This has been added below the imports in the top:

# Set GDAL to avoid scanning read directories to avoid IReadBlock errors in Rasterio when using num_workers>0 in the 
# dataloader. We should not have any concurrency issues for this dataset, as each worker should read individual tif files,
# but it seems like it might occasionally happen when multiple workers scan the same directory simultaneously. 
os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR"

I've also minimized the time the datareader is open in Rasterio by closing it before doing the tensor operations. This shouldn't really solve the issue here, but added it anyways for good practice. In the _load_image() function:

with rasterio.open(path) as f:
    array: np.typing.NDArray[np.int_] = f.read()
tensor = torch.from_numpy(array).float()
if "B05" in self.bands:
    # Height channel will always be the last dimension
    tensor[-1] = torch.div(tensor[-1], 5)

and in the _load_target() function:

with rasterio.open(path) as f:
    array: np.typing.NDArray[np.int_] = f.read(1)
tensor = torch.from_numpy(array).long()
# According to datapaper, the dataset contains classes beyond 13
# however, those are grouped into a single "other" class
# Rescale the classes to be in the range [0, 12] by subtracting 1
torch.clamp(tensor - 1, 0, len(self.classes) - 1, out=tensor)

Not sure if it fixes it, but I'll train a handful of models and try it out. The issue coming from simultaneous directory scans is a bit of a guess.

JacobJeppesen avatar Nov 26 '24 15:11 JacobJeppesen

Never had this multiprocessing issue before, and I don't see anything in the code that looks sus. Can you reproduce this issue with the unit tests? If not I can also try downloading the dataset and reproducing locally.

adamjstewart avatar Nov 26 '24 17:11 adamjstewart

Unfortunately not. It pops up at random after training for quite a while. Encountered it the first time after having trained for almost 2 full epochs. I.e., it had already read the entire dataset once, and then got IReadBlock error during second epoch. There have been some reports with similar errors here: https://github.com/rasterio/rasterio/issues/2053 . That's generally when reading the same file though, and I agree, nothing looks suspicious in the code. I actually thought my disk was broken, but then I remembered that that was my exact same thought last time I got this error with Rasterio.

I'll try and let it run some more times and see if it keeps happening. Just saw in https://github.com/rasterio/rasterio/issues/2053#issuecomment-1180666329 that upgrading to newer version of GDAL might help. I'll also give that a try.

JacobJeppesen avatar Nov 26 '24 20:11 JacobJeppesen

Finally got it debugged, and it was a system error. Download, extract, and training on the aerial data seems to all work as it should. Sorry about the noise!

Completely unrelated to this PR, but it was a corruption in one of the zpools in ZFS. So if you start using that at some point, and you get weird errors that looks like disk errors, but checking the disk says everything is fine, then try checking the status on your zpools. Spent too much time identifying this :expressionless:

JacobJeppesen avatar Nov 28 '24 13:11 JacobJeppesen

Can you resolve the merge conflicts so we can run CI on this PR?

adamjstewart avatar Nov 28 '24 13:11 adamjstewart

I think everything should be on track so far. I am getting a ruff error for unsorted tuples in the [datasets/datamodules]/__init.py__ files. I explicitly tried sorting them using with the vscode sort ascending command (8779f89), but apparently this did work :man_shrugging:. Let me know if there is anything else I can do for you =)

MathiasBaumgartinger avatar Nov 28 '24 17:11 MathiasBaumgartinger

If you use ruff 0.8.0 it will sort __all__ correctly. You can also copy-n-paste the file from the latest version and add the new line.

adamjstewart avatar Nov 28 '24 19:11 adamjstewart

Thanks everyone for reviewing. Tried to apply all suggested changes.

As for the checks, according to the log we face the following error: /home/docs/checkouts/readthedocs.org/user_builds/torchgeo/checkouts/2394/docs/api/datasets.rst:194: ERROR: "csv-table" widths do not match the number of columns in table (10). Truly, I have messed something up here (i.e. something with line endings, separations and quotation marks). However, I resolved this problem and the error in the check seems to persist.

MathiasBaumgartinger avatar Dec 09 '24 09:12 MathiasBaumgartinger

Does anyone have general opinions on whether we should call this:

FLAIR2()

or:

FLAIR(version=2)

I think it'd make sense to use FLAIR(version=2), as it seems like each new version is a superset of the previous version. Most users will probably use the latest version, so if they are individual datasets, FLAIR1() and FLAIR2() might end up as somewhat unused datasets once FLAIR3() is released. As I understand https://github.com/microsoft/torchgeo/issues/2303#issuecomment-2390977539, once version 3 is released, version 1 and 2 data can be directly loaded from version 3 by filtering the files. So the lowest complexity solution would probably be a FLAIR() dataset, where version 3 is the full dataset, version 2 is reduced coverage by filtering files/area, and version 1 is the same reduced coverage, but only aerial.

JacobJeppesen avatar Dec 13 '24 09:12 JacobJeppesen

Does anyone have general opinions on whether we should call this:

FLAIR2()

or:

FLAIR(version=2)

I think it'd make sense to use FLAIR(version=2), as it seems like each new version is a superset of the previous version. Most users will probably use the latest version, so if they are individual datasets, FLAIR1() and FLAIR2() might end up as somewhat unused datasets once FLAIR3() is released. As I understand #2303 (comment), once version 3 is released, version 1 and 2 data can be directly loaded from version 3 by filtering the files. So the lowest complexity solution would probably be a FLAIR() dataset, where version 3 is the full dataset, version 2 is reduced coverage by filtering files/area, and version 1 is the same reduced coverage, but only aerial.

I am in contact with @agarioud. If I do understand him correctly, I doubt that new datasets will have the requirement to be backward-compatible.

Unfortunately there is no direct compatibility with FLAIR#1/#2 apart the aerial images as we reworked the supervision (land cover and now LPIS)

But I agree. Adding versioning to FLAIR() will probably result in less dead code. From a design perspective:

  1. We have a parent dataset FLAIR handling all the common logic. By passing a specific version (default will always be latest) a new child class is initialized which overrides version specific logic.
  2. We have a parent datamodule FLAIRModule. Which passes a version down to datasets. I.e. probably only a single FLAIRModule is necessary (no inheritance).
  3. Points 1 and 2 will be applied for FLAIRToy and FLAIRToyModule too (probably both will inherit from the corresponding non-toy classes.)

MathiasBaumgartinger avatar Dec 13 '24 09:12 MathiasBaumgartinger

The upcoming FLAIR-INC dataset will include all data from FLAIR#1 (aerial images) and FLAIR#2 (which added Sentinel-2 data that were previously NPY files covering larger extents but will now have the same spatial extent as the aerial patches and be in TIFF format). Additionally, it will introduce five new modalities and a second supervision dataset. If needed, a toy dataset can already be shared with @MathiasBaumgartinger

Backward compatibility is not fully ensured. For example, aerial images will have one less channel, as DSM/DTM has been introduced as a separate modality. Additionally, the supervision dataset regarding land-cover has reordered classes.

Therefore, I also believe a FLAIR() dataset would probably be more efficient ?

agarioud avatar Dec 13 '24 10:12 agarioud

But I agree. Adding versioning to FLAIR() will probably result in less dead code. From a design perspective:

  1. We have a parent dataset FLAIR handling all the common logic. By passing a specific version (default will always be latest) a new child class is initialized which overrides version specific logic.
  2. We have a parent datamodule FLAIRModule. Which passes a version down to datasets. I.e. probably only a single FLAIRModule is necessary (no inheritance).
  3. Points 1 and 2 will be applied for FLAIRToy and FLAIRToyModule too (probably both will inherit from the corresponding non-toy classes.)

This sounds like a good approach :+1:

@agarioud sounds great. Looking forward to the release :slightly_smiling_face:

JacobJeppesen avatar Dec 13 '24 12:12 JacobJeppesen

It's not really a matter of compatibility. Basically, we either use:

class FLAIR(NonGeoDataset, abc.ABC):
    # shared base class

class FLAIR2(FLAIR):
    # override specific stuff

or:

class FLAIR(NonGeoDataset):
    def __init__(self, version=2, ...):
        if version == 2:
            # override specific stuff

It's more a question of whether the devs think of this as a new version of an existing dataset or a new dataset. We usually go with the former because it offers a bit better reproducibility (if the default version gets changed, then reproducibility is broken). See our EuroSAT dataset for an example of this. I think the only cases where we use version instead are in MoCoTask and SimCLRTask. I'm fine with both solutions, just want to make sure we're all on the same page.

adamjstewart avatar Dec 13 '24 15:12 adamjstewart

Is this still a WIP? Would love to include this in the 0.7.0 release next week.

adamjstewart avatar Mar 13 '25 17:03 adamjstewart

It still is a WIP. What is your deadline for 0.7.0?

MathiasBaumgartinger avatar Mar 17 '25 11:03 MathiasBaumgartinger

Not finalized yet but hopefully by the end of this week.

adamjstewart avatar Mar 17 '25 14:03 adamjstewart