torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

Added custom semantic segmentation trainer tutorial

Open calebrob6 opened this issue 1 year ago • 33 comments

This is a tutorial notebook that shows users how to override a custom semantic segmentation class for training on LandCoverAI.

calebrob6 avatar Feb 21 '24 23:02 calebrob6

@adamjstewart I think there is a bug (or at least some weird behavior going on) here

I can create a new class that extends SemanticSegmentationTask:

class CustomSemanticSegmentationTask(SemanticSegmentationTask):

    # any keywords we add here between *args and **kwargs will be found in self.hparams
    def __init__(self, *args, tmax=50, eta_min=1e-6, **kwargs) -> None:
        super().__init__(*args, **kwargs)  # pass args and kwargs to the parent class

Then instantiate and everything works great

task = CustomSemanticSegmentationTask(model="unet", tmax=100, ...)

However when I go to load from file:

task = CustomSemanticSegmentationTask.load_from_checkpoint("lightning_logs/version_3/checkpoints/epoch=0-step=117.ckpt")

I get an error:

TypeError: SemanticSegmentationTask.__init__() got an unexpected keyword argument 'ignore'

I can add del kwargs["ignore"] before super().... in the constructor of CustomSemanticSegmentationTask but this feels like a bad hack.

(@isaaccorley in case you've seen this)

calebrob6 avatar Feb 22 '24 21:02 calebrob6

Note that https://www.reviewnb.com/ is free for 'educational' use - would enable previewing the notebook

robmarkcole avatar Feb 23 '24 10:02 robmarkcole

@robmarkcole thanks for the review! Was that easy enough to do (vs reviewnb)?

On my side, I just have to run jupytext --sync custom_segmentation_trainer.ipynb whenever I change either file and commit both (which seems simple enough). We could have CI that makes sure that all py and ipynb are in sync.

calebrob6 avatar Feb 23 '24 17:02 calebrob6

Yep easy enough

robmarkcole avatar Feb 23 '24 17:02 robmarkcole

@adamjstewart I think there is a bug (or at least some weird behavior going on) here

I can create a new class that extends SemanticSegmentationTask:


class CustomSemanticSegmentationTask(SemanticSegmentationTask):



    # any keywords we add here between *args and **kwargs will be found in self.hparams

    def __init__(self, *args, tmax=50, eta_min=1e-6, **kwargs) -> None:

        super().__init__(*args, **kwargs)  # pass args and kwargs to the parent class

Then instantiate and everything works great


task = CustomSemanticSegmentationTask(model="unet", tmax=100, ...)

However when I go to load from file:


task = CustomSemanticSegmentationTask.load_from_checkpoint("lightning_logs/version_3/checkpoints/epoch=0-step=117.ckpt")

I get an error:


TypeError: SemanticSegmentationTask.__init__() got an unexpected keyword argument 'ignore'

I can add del kwargs["ignore"] before super().... in the constructor of CustomSemanticSegmentationTask but this feels like a bad hack.

(@isaaccorley in case you've seen this)

I don't see an ignore param in your custom class. Did you modify the class code you used after training that checkpoint? If so, one thing you can do is to just load the checkpoint, delete that param in the hparam dict and then save the checkpoint which would fix the error.

isaaccorley avatar Feb 23 '24 17:02 isaaccorley

The ignore param is stored in task.hparams because SemanticSegmentationTask passes it to BaseTask, -- https://github.com/microsoft/torchgeo/blob/main/torchgeo/trainers/segmentation.py#L98.

I want to be able to do something like this:

class CustomSemanticSegmentationTask(SemanticSegmentationTask):
    def __init__(self, *args, my_custom_arg, **kwargs) -> None:
        super().__init__(*args, **kwargs)

i.e. use the constructor from SemanticSegmentationTask and not have to copy paste the args and logic from SemanticSegmentationTask.

This works fine but, when I try to load a version of this class from checkpoint ignore is passed through to SemanticSegmentationTask as it is a kwarg saved in hparams.

calebrob6 avatar Feb 23 '24 19:02 calebrob6

One workaround that I'm checking is just adding "ignore" to the list of args ignored in save_hyperparameters in BaseTask.

calebrob6 avatar Feb 23 '24 19:02 calebrob6

One workaround that I'm checking is just adding "ignore" to the list of args ignored in save_hyperparameters in BaseTask.

I think this makes the most sense since it's not an actual hparam.

isaaccorley avatar Feb 23 '24 20:02 isaaccorley

The way that I've come up with to check whether an .ipynb is in sync with the corresponding .py is diff <(jupytext --output - --to py custom_segmentation_trainer.ipynb) custom_segmentation_trainer.py. This should only output something like:

3a4
> #     formats: ipynb,py

calebrob6 avatar Feb 23 '24 21:02 calebrob6

I don't know why notebook test is being cancelled (maybe because it is trying to run the LandCoverAI split script?).

calebrob6 avatar Feb 23 '24 22:02 calebrob6

I don't know why notebook test is being cancelled (maybe because it is trying to run the LandCoverAI split script?).

Tests being canceled means the job either ran out of time, space, or memory. Here, my guess would be space. We want to use the smallest datasets possible, such as EuroSAT100. Might be worth creating a LandCoverAI100 or something like that.

adamjstewart avatar Feb 25 '24 19:02 adamjstewart

Haven't yet had time to evaluate jupytext to decide whether or not it's what we should use. @nilsleh what did you end up using for lightning-uq-box?

adamjstewart avatar Feb 25 '24 19:02 adamjstewart

I really don't like the idea of making custom datasets/datamodules just to have pretty CI -- it is a large overhead for something that makes the tutorial less cool.

calebrob6 avatar Feb 25 '24 20:02 calebrob6

And I really don't like having tutorials that take 30+ minutes to download a dataset and train a model for hundreds of epochs, or tutorials that can't be tested in CI because they involve more data than our runners can store. There's always a tradeoff. You can also find a smaller dataset instead of making your own.

adamjstewart avatar Feb 25 '24 20:02 adamjstewart

Luckily none of that happens here ;). LandCover.ai is 1.5GB (this will take 30+ minutes to download if your download speed is < 0.83 MB/s), training happens for 1 batch (and I can reduce the batch size to make this faster).

I'm saying that we shouldn't be catching bugs with overly sanitized examples -- if LandCoverAI breaks or Lightning training breaks then our other tests will break. If the example notebooks are catching bugs then we should ask ourselves why. Downloading LandCoverAI and running this now and per release doesn't seem to be a big burden.

calebrob6 avatar Feb 25 '24 22:02 calebrob6

How about-- is it possible to change LandCoverAI datamodule to use the test data that we already have spent time creating for this notebook (then comments saying if you actually want to play with data, then do this other thing)?

calebrob6 avatar Feb 25 '24 22:02 calebrob6

Haven't yet had time to evaluate jupytext to decide whether or not it's what we should use. @nilsleh what did you end up using for lightning-uq-box?

So I tried jupytext, but for my tutorials I couldn't get the jupytext scripts to execute and be displayed on the documentation. I went back to notebooks, and the notebooks are now run when the documentation builds. However, I don't need to download any data and model fitting is fast, since it's just toy problems.

nilsleh avatar Feb 26 '24 08:02 nilsleh

How about-- is it possible to change LandCoverAI datamodule to use the test data that we already have spent time creating for this notebook (then comments saying if you actually want to play with data, then do this other thing)?

This is also undesirable because the notebook by default will train and display predictions on random noise. I would much rather have a tiny dataset with real data. I'm happy to make this myself (maybe next week).

adamjstewart avatar Feb 26 '24 12:02 adamjstewart

I don't think it needs to display predictions -- as if we're only training for a batch for "making CI pretty" reasons then it will display noise regardless.

calebrob6 avatar Feb 26 '24 12:02 calebrob6

We can train for multiple epochs in the tutorial, but use fast_dev_run in CI. The trainers tutorial does this with nbmake variable mocking. This also means we could use the synthetic dataset during CI and the real dataset during the tutorial. My only hope is that we don't train for longer than it takes to explain what we're doing in a live tutorial session. If we use this notebook in a tutorial and end up sitting there watching it awkwardly for an hour then it gets tedious.

adamjstewart avatar Feb 26 '24 16:02 adamjstewart

@adamjstewart -- checking in on this, do we want to force a LandCoverAI100 Datamodule for this?

calebrob6 avatar Aug 29 '24 16:08 calebrob6

If the tests are failing due to OOM then we'll have to do something like that. I think it will demonstrate how quick and easy it is to get a working model using a pre-trained backbone and a tiny dataset. We could even add a pre-trained U-Net model.

adamjstewart avatar Aug 29 '24 17:08 adamjstewart

Yeah I'd love to follow this up with a "how to use pre-trained models" -- just need to figure this one out first

calebrob6 avatar Aug 29 '24 18:08 calebrob6

Adding LandCoverAI100 in https://github.com/microsoft/torchgeo/pull/2262

calebrob6 avatar Aug 29 '24 19:08 calebrob6

As a nice addition, you could show how to export a la https://qgis-plugin-deepness.readthedocs.io/en/latest/example/example_segmentation_landcover.html

robmarkcole avatar Sep 13 '24 15:09 robmarkcole

Ready for review now that we've merged LandCoverAI100.

Easy link for looking at the notebook -- https://github.com/microsoft/torchgeo/blob/tutorial/week1/docs/tutorials/custom_segmentation_trainer.ipynb

calebrob6 avatar Sep 18 '24 19:09 calebrob6

Tests are failing on importing LandCoverAI100DataModule -- is this because we install torchgeo in first cell?

calebrob6 avatar Sep 19 '24 16:09 calebrob6

Yes, see #2300 for the fix.

adamjstewart avatar Sep 20 '24 09:09 adamjstewart

Fix is implemented in #2306

sfalkena avatar Sep 20 '24 11:09 sfalkena

Nice, thanks @sfalkena! What are the odds of finding this bug at nearly the same time 🙂

calebrob6 avatar Sep 20 '24 15:09 calebrob6