torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

DataModules: pass kwargs directly to datasets

Open adamjstewart opened this issue 2 years ago • 1 comments

Closes #666

Pros

  • Consistency: every datamodule now has the same kwargs as its corresponding dataset
  • New features: this means that every datamodule now supports automatic downloads if its corresponding dataset does too
  • No unused kwargs: previously, a typo in an argument or non-existent argument would go unnoticed, now it will raise an error
  • Reduced code duplication: -140 lines of code

Cons

  • Backwards incompatibility, root_dir has been renamed to root for consistency
  • Can no longer pass root_dir as the first positional arg, must be a keyword arg

To elaborate, the following are no longer valid:

BigEarthNetDataModule('data/bigearthnet')
BigEarthNetDataModule(root_dir='data/bigearthnet')

Instead, users will need to use:

BigEarthNetDataModule(root='data/bigearthnet')
BigEarthNetDataModule(64, 0, 'data/bigearthnet')

adamjstewart avatar Aug 19 '22 01:08 adamjstewart

Will give @calebrob6 a chance to review, no rush on this PR since it's going in 0.4.0.

adamjstewart avatar Aug 31 '22 18:08 adamjstewart

Agree that this was much needed -- I just don't like how it isn't obvious from the DataModule constructors that you need to pass a root parameter. Since every dataset will need a root parameter can't we just keep it in?

calebrob6 avatar Sep 24 '22 23:09 calebrob6

Since every dataset will need a root parameter

In theory, we could someday have datasets without a root parameter (streaming data from Azure/AWS for example). But yes, all of our current datasets have one.

I just don't like how it isn't obvious from the DataModule constructors that you need to pass a root parameter... can't we just keep it in?

We could keep it in and forward it like we used to. Or we could properly document the available kwargs, or the fact that kwargs will be forwarded. Or both.

adamjstewart avatar Sep 25 '22 00:09 adamjstewart

Properly document the available kwargs seems best to me

calebrob6 avatar Sep 25 '22 19:09 calebrob6

@calebrob6 check out the docs for the latest version and see if that's better. The alternative would be to make a formal doc section like:

Initialize a LightningDataModule for BigEarthNet based DataLoaders.           
                                                                                         
Args:                                                                            
    batch_size: The batch size to use in all created DataLoaders                 
    num_workers: The number of workers to use in all created DataLoaders         

Keyword Args:
    root: root directory where dataset can be found                              
    split: train/val/test split to load                                          
    bands: load Sentinel-1 bands, Sentinel-2, or both. one of {s1, s2, all}      
    num_classes: number of classes to load in target. one of {19, 43}            
    transforms: a function/transform that takes input sample and its target as   
        entry and returns a transformed version                                  
    download: if True, download dataset and store it in the root directory       
    checksum: if True, check the MD5 of the downloaded files (may be slow)

but that ends up with a lot of duplication and becomes harder to maintain.

adamjstewart avatar Sep 27 '22 16:09 adamjstewart

Closing/reopening to try to kick off CLA bot.

adamjstewart avatar Sep 27 '22 20:09 adamjstewart

@microsoft-github-policy-service agree

adamjstewart avatar Sep 27 '22 20:09 adamjstewart

This looks like a good compromise on documentation. What if we specially check for "root" in kwargs and raise a helpful error message if it isn't included? My worry is that if a user doesn't add root they'll get an error message from the Dataset object which might be confusing.

calebrob6 avatar Sep 30 '22 17:09 calebrob6

(or warning)

calebrob6 avatar Sep 30 '22 17:09 calebrob6

The stack trace will tell them the full path the code took to raise that error (not necessarily clear to a non-programmer though). root isn't required, all datasets have a default root. If anyone ever implements #660 the error message will be a bit more helpful for datasets that don't already explain what root is set to.

adamjstewart avatar Sep 30 '22 17:09 adamjstewart

It only needs to be a message on the datasets that do have root, right?

On Fri, Sep 30, 2022, 10:53 AM Adam J. Stewart @.***> wrote:

The stack trace will tell them the full path the code took to raise that error (not necessarily clear to a non-programmer though). root isn't required, all datasets have a default root. If anyone ever implements #660 https://github.com/microsoft/torchgeo/issues/660 the error message will be a bit more helpful for datasets that don't already explain what root is set to.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/torchgeo/pull/730#issuecomment-1263855368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJUTUQT3UF7ZH6YGJJ6KTWA4SK5ANCNFSM567BHLFA . You are receiving this because you were mentioned.Message ID: @.***>

calebrob6 avatar Sep 30 '22 18:09 calebrob6

At the moment this is every dataset.

adamjstewart avatar Sep 30 '22 18:09 adamjstewart