torchgeo
torchgeo copied to clipboard
DataModules: pass kwargs directly to datasets
Closes #666
Pros
- Consistency: every datamodule now has the same kwargs as its corresponding dataset
- New features: this means that every datamodule now supports automatic downloads if its corresponding dataset does too
- No unused kwargs: previously, a typo in an argument or non-existent argument would go unnoticed, now it will raise an error
- Reduced code duplication: -140 lines of code
Cons
- Backwards incompatibility,
root_dir
has been renamed toroot
for consistency - Can no longer pass
root_dir
as the first positional arg, must be a keyword arg
To elaborate, the following are no longer valid:
BigEarthNetDataModule('data/bigearthnet')
BigEarthNetDataModule(root_dir='data/bigearthnet')
Instead, users will need to use:
BigEarthNetDataModule(root='data/bigearthnet')
BigEarthNetDataModule(64, 0, 'data/bigearthnet')
Will give @calebrob6 a chance to review, no rush on this PR since it's going in 0.4.0.
Agree that this was much needed -- I just don't like how it isn't obvious from the DataModule constructors that you need to pass a root parameter. Since every dataset will need a root parameter can't we just keep it in?
Since every dataset will need a root parameter
In theory, we could someday have datasets without a root parameter (streaming data from Azure/AWS for example). But yes, all of our current datasets have one.
I just don't like how it isn't obvious from the DataModule constructors that you need to pass a root parameter... can't we just keep it in?
We could keep it in and forward it like we used to. Or we could properly document the available kwargs, or the fact that kwargs will be forwarded. Or both.
Properly document the available kwargs seems best to me
@calebrob6 check out the docs for the latest version and see if that's better. The alternative would be to make a formal doc section like:
Initialize a LightningDataModule for BigEarthNet based DataLoaders.
Args:
batch_size: The batch size to use in all created DataLoaders
num_workers: The number of workers to use in all created DataLoaders
Keyword Args:
root: root directory where dataset can be found
split: train/val/test split to load
bands: load Sentinel-1 bands, Sentinel-2, or both. one of {s1, s2, all}
num_classes: number of classes to load in target. one of {19, 43}
transforms: a function/transform that takes input sample and its target as
entry and returns a transformed version
download: if True, download dataset and store it in the root directory
checksum: if True, check the MD5 of the downloaded files (may be slow)
but that ends up with a lot of duplication and becomes harder to maintain.
Closing/reopening to try to kick off CLA bot.
@microsoft-github-policy-service agree
This looks like a good compromise on documentation. What if we specially check for "root" in kwargs and raise a helpful error message if it isn't included? My worry is that if a user doesn't add root
they'll get an error message from the Dataset object which might be confusing.
(or warning)
The stack trace will tell them the full path the code took to raise that error (not necessarily clear to a non-programmer though). root
isn't required, all datasets have a default root
. If anyone ever implements #660 the error message will be a bit more helpful for datasets that don't already explain what root
is set to.
It only needs to be a message on the datasets that do have root, right?
On Fri, Sep 30, 2022, 10:53 AM Adam J. Stewart @.***> wrote:
The stack trace will tell them the full path the code took to raise that error (not necessarily clear to a non-programmer though). root isn't required, all datasets have a default root. If anyone ever implements #660 https://github.com/microsoft/torchgeo/issues/660 the error message will be a bit more helpful for datasets that don't already explain what root is set to.
— Reply to this email directly, view it on GitHub https://github.com/microsoft/torchgeo/pull/730#issuecomment-1263855368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJUTUQT3UF7ZH6YGJJ6KTWA4SK5ANCNFSM567BHLFA . You are receiving this because you were mentioned.Message ID: @.***>
At the moment this is every dataset.