mmtracking icon indicating copy to clipboard operation
mmtracking copied to clipboard

Balanced multiple datasets with variable sizes (for train)?

Open askarbozcan opened this issue 4 years ago • 3 comments

Assume I have several datasets (CocoVID format and CocoVideoDataset wrapper) however the number of samples between the datasets differs a lot. Is there a way (even if I will have to implement a custom module) to make sure that when training, train samples are picked from all datasets fairly?

So, even if I have Dataset A with 10000 frames and Dataset B with only 100, train samples will consist of %50 Dataset A and %50 Dataset B.

Thank you.

askarbozcan avatar Sep 04 '21 11:09 askarbozcan

Oh, figured it out. Using a combination of RepeatDataset and ConcatDataset, we can manually balance out the final train set using ratios.

Example: Dataset A has 100 samples, Dataset B has 10 and Dataset C has 1.

By setting the "times" parameter (of RepeatDataset wrapper) for Dataset A as 1, Dataset B as 10 and Dataset C as 100 we can in the end balance things out.

Keeping it here, in case anyone else googles for it.

askarbozcan avatar Sep 04 '21 12:09 askarbozcan

Yes, we provide some dataset warpper to mix the dataset or modify the dataset distribution for training.

GT9505 avatar Sep 06 '21 01:09 GT9505

Oh, figured it out. Using a combination of RepeatDataset and ConcatDataset, we can manually balance out the final train set using ratios.

Example: Dataset A has 100 samples, Dataset B has 10 and Dataset C has 1.

By setting the "times" parameter (of RepeatDataset wrapper) for Dataset A as 1, Dataset B as 10 and Dataset C as 100 we can in the end balance things out.

Keeping it here, in case anyone else googles for it.

Hello, I had a question about how to use the RepeatDataset. @askarbozcan If i use the dataset_type is 'CocoVideoDataset' ,so i need to rewrite the config like :

**train** = dict(
        type='RepeatDataset',
        times=10,
        dataset=dict(  # This is the original config of Dataset_A
            type=**'CocoVideoDataset'**,
            classes=classes,
            ann_file='path/to/your/train/data',
            pipeline=train_pipeline
        )
    )

Is that right? Could you please give me some advice? Thank you so much!

image image

lijoe123 avatar Apr 05 '23 08:04 lijoe123