super-gradients icon indicating copy to clipboard operation
super-gradients copied to clipboard

Cityscapes AutoLabelling dataset

Open lkdci opened this issue 1 year ago • 1 comments

Cityscapes AutoLabelled dataset were introduced by NVIDIA research group. paper: Hierarchical Multi-Scale Attention for Semantic Segmentation", https://arxiv.org/abs/2005.10821 Official repo: https://github.com/NVIDIA/semantic-segmentation

This PR includes:

  • CityscapesConcatDataset to support combination of cityscapes subsets.
  • example ddrnet recipe with AL dataset.

lkdci avatar May 14 '23 14:05 lkdci

Hi @Louis-Dupont there is a design conflict about the Dataloader creation.

A dataloader - dataset creation strategy can be done in two different way,

First Approach - dataloader factory:

train_dataloader: cityscapes_train

This approach is problematic since it hinder loading default parameters from a default yaml file defined in code. Then when passing the dataset_params through the config recipe, we force default value that we might not want to include, but they are injected within the code, which contradict the yaml approach for building configs.

see this example:

dataloader factory method:

def my_dataset_train(...):
   return get_data_loader(config_name="my_dataset_default", dataset_cls=MyDataset)

my_dataset_default.yaml:

...
train_dataloader_params:
  sampler: "my_data_sampler"

then in main_recipe_config.yaml:

train_dataloader: my_dataset_train

Following this examples we are not able to initiate my dataset without the sampler field, and we might easily miss is injected in the first place into the dataloader params. (This issue was reported before for the coco dataset with infinite sampler in previous versions.)

Seccond Approach - dataset factory:

Explicitly define the dataset type to use without using the wrapper dataloader factory:

Following the previous example, my_dataset_default.yaml, we add the dataset key:

my_dataset_default.yaml:

...
train_dataloader_params:
  dataset: MyDataset
  sampler: "my_data_sampler"

In contrary to the previous approach we are not bounded by the above default params, and we can set a different dataset params file:

my_dataset_custom_params.yaml:

...
train_dataloader_params:
  dataset: MyDataset

Then in then in main_recipe_config.yaml explicitly choose the required dataset_params to use:

defaults:
  - dataset_params: my_dataset_custom_params

IMO this approach is preferable with better visibility, and doesn't involves hidden behavior within the dataloader factory code.

Why not supporting both approaches?

Both approaches are supported within SG, but there is bug to use both for a given dataset, and the following error is raised:

Error
Traceback (most recent call last):
  File "/home/lior.kadoch/PycharmProjects/super-gradients/tests/unit_tests/dataloader_factory_test.py", line 286, in test_cityscapes_al_train_creation
    dl_train = cityscapes_auto_labelling_train()
  File "/home/lior.kadoch/PycharmProjects/super-gradients/src/super_gradients/training/dataloaders/dataloaders.py", line 548, in cityscapes_auto_labelling_train
    return get_data_loader(
  File "/home/lior.kadoch/PycharmProjects/super-gradients/src/super_gradients/training/dataloaders/dataloaders.py", line 80, in get_data_loader
    dataloader = DataLoader(dataset=dataset, **dataloader_params)
TypeError: type object got multiple values for keyword argument 'dataset'

lkdci avatar May 15 '23 10:05 lkdci