avalanche Issue with class-incremental SI and LwF

Hi!

We are currently trying to use Avalanche in our research, as it looks like an amazing tool providing a lot of ready-to-use tools. However, we encountered some issues that stopped us from moving further.

Our goal is to work with class-incremental scenarios. We experimented on MNIST, built in two different ways - using SplitMNIST benchmark and building a benchmark leveraging nc_benchmark method.

    scenario_toggle = 'nc_MNIST'        # 'nc_MNIST' (nc_benchmark) or 'splitMNIST' (SplitMNIST)
    task_labels = False
    if scenario_toggle == 'splitMNIST':
        scenario = SplitMNIST(n_experiences=5, return_task_id=task_labels, fixed_class_order=list(range(10)))
    elif scenario_toggle == 'nc_MNIST':
        train = MNIST(root=f'data', download=True, train=True, transform=train_transform)
        test = MNIST(root=f'data', download=True, train=False, transform=test_transform)
        scenario = nc_benchmark(
            train, train, n_experiences=5, shuffle=False, seed=1234,
            task_labels=task_labels, fixed_class_order=list(range(10))
        )

We tried using two strategies: LwF and SI. We tried both values of scenario_toggle: splitMNIST and nc_MNIST. However, the evaluation results in both cases suggest that only the last experience is remembered and recognized. All other experiences have an accuracy equal to 0.00, which is unexpected and suggests that something is wrong.

Sample results:

eval_exp,training_exp,eval_accuracy,eval_loss,forgetting
0,0,1.0000,0.0000,0
1,0,0.0000,13.4988,0
2,0,0.0000,12.4454,0
3,0,0.0000,16.2600,0
4,0,0.0000,16.5519,0
0,1,0.0000,17.6948,1.0000
1,1,0.9998,0.0011,0
2,1,0.0000,13.9904,0
3,1,0.0000,14.3507,0
4,1,0.0000,15.7323,0
0,2,0.0000,14.9688,1.0000
1,2,0.0000,23.8310,0.9998
2,2,1.0000,0.0000,0
3,2,0.0000,13.4626,0
4,2,0.0000,14.5919,0
0,3,0.0000,21.7956,1.0000
1,3,0.0000,26.1762,0.9998
2,3,0.0000,33.1996,1.0000
3,3,1.0000,0.0001,0
4,3,0.0000,21.8687,0
0,4,0.0000,18.1376,1.0000
1,4,0.0000,14.5067,0.9998
2,4,0.0000,20.7260,1.0000
3,4,0.0000,24.5977,1.0000
4,4,0.9990,0.0035,0

Similar behavior can be observed for all combinations (splitMNIST, nc_MNIST combined with LwF and EWC) when task_labels = False. When we change the task_labels to True, the results start to make sense with values between 0.6 and 1 for all previously learned experiences.

We are not sure whether the problem is in our approach, our code, or maybe if there is some bug impacting our results. Therefore, we have a few questions:

Is our approach valid? Is setting task_labels to False equal to creating class-incremental benchmarks? And task_labels = True brings task-incremental scenario?
Is there any reason why the results look like that? Is it the issue with how we use benchmarks?

We will appreciate any suggestions, as we have already spent some time with Avalanche and we would love to leverage all the tools it provides.

I am providing the minimal test project we prepared. avalanche-test-project.zip

Jun 20 '22 17:06 Nyderx

we have reproducibility scripts for SI and EWC in task-incremental settings. It is known from many results in the literature that they do not work well without task labels.

Jun 27 '22 12:06 AntonioCarta

Follow-up comment on @AntonioCarta's: Here is a paper showing that SI and LwF almost fail in class-incremental scenarios for Split-MNIST: https://arxiv.org/pdf/1904.07734.pdf

Screenshot 2022-06-27 at 14 30 29

By changing your architecture to the one used in the CL baselines repository, you may get a small increase in the average accuracy (better than complete forgetting) for LwF.

Jun 27 '22 12:06 HamedHemati

I’m working with @Nyderx In my case I'm using the same ds (mnist). And my results are the same as above when I;m using nc_benchamrk method. But in the case of dataset_benchmark the results are different (pls see attachment). In my opinion in both cases : nc_benchmark or dataset_benchmark the results should be exactly the same but they are not (regardless if we add labels or not). Interesting thing is that in the case of dataset_benchmark without labels for SI and EWC their results are much better then in the case of nc_bencmahrk without labels (there is no 0 <- catastrophic forgetting does not occur). So my question is which method returns proper results ?

In attachment there are results for both creating methods with both scenarios : with/without task_label. Moreover I’ve added that case to the program to be possible to reproduce.

avalanche-test-project_with_ds.zip simple_mnist_tests.zip

Jun 29 '22 12:06 dominik1102

Can you summarize your results here and why you think they are wrong?

Jun 30 '22 16:06 AntonioCarta

@AntonioCarta Thanks for your response.

First of all let’s focus on SI and EWC As you @Nyderx showed and as you can see in my result, when we create benchmarks through the usage of nc_benchamark method without task_label in the case of SI and EWC we observe catastrophic forgetting. But in the case of dataset_benchmark without task_labels we don’t observe such catastrophic forgetting. But in my opinion the results should be exactly the same (or very similar). We would like to understand what is the reason for those differences and which method should be used in what case ? (the difference is only how we create benchmarks, the rest of the code is exactly the same).

Moreover in the rest cases i.e GEM without task_id the results are different (in some steps we can observe 20% of discrepancy which is significant).

Pls take a look at the results which I’ve attached in my previous post. This is very important because in our work we want to improve some method to avoid catastrophic forgetting but in the case of when the way we create a benchmark returns totally different results we are afraid that one of them is wrong.

Jul 01 '22 09:07 dominik1102

Hi @dominik1102 , I agree that the results should be the same if you build the benchmark with nc_benchmark or dataset_benchmark. Can you post here a small script reproducing the behavior? In particular, the code creating the dataset_benchmark may be useful. If you can, copy/paste the code here with python formatting

Jul 01 '22 09:07 AndreaCossu

@AndreaCossu in my previous post (four above) there is a script and result- you can simply reproduce this. In any doubt let me know.

Jul 01 '22 09:07 dominik1102

Code looks fine, I am wondering if the difference is only in the metric names when you use task labels. Can you print here the output of the TextLogger when you are using the dataset_benchmark generator? Take in consideration that metrics will be split by task in that case.

Jul 01 '22 09:07 AndreaCossu

Code looks fine, I am wondering if the difference is only in the metric names when you use task labels. Can you print here the output of the TextLogger when you are using the dataset_benchmark generator? Take in consideration that metrics will be split by task in that case.

sure I'll do this asap.

Jul 01 '22 09:07 dominik1102

@AndreaCossu

here you are txt logs : logs_without_task_labels.zip

Moreover I took a look on TensorboardLogger but the traning process looks fine.

Jul 04 '22 09:07 dominik1102

After a quick look I think the problem is related to the targets of the dataset of each experience. You can see it by printing them with this code:

for train_batch_info in scenario.train_stream:
    print(list(set([y for _, y, *_ in train_batch_info.dataset])))

for test_batch_info in scenario.test_stream:
    print(list(set([y for _, y, *_ in test_batch_info.dataset])))

If you use task labels but no multihead, ensure that targets are in [0,9]. With SplitMNIST you get targets in [0,1] with task labels activated, but with dataset_benchmark you should be able to use the full set of targets (take a look in the API doc for the arguments you can set). Can you please give it a try with the correct targets and see if the performance still remains high when activating task labels?

The fact that targets are silently converted in SplitMNIST is rather confusing, we should either make this more visible in the APIdoc or allow the user to override this behavior @AntonioCarta

Jul 04 '22 10:07 AndreaCossu

Thank you for all your help. We hope to have a long and successful adventure with Avalanche :)

Dominik will tomorrow provide the output you asked for.

I am a little confused right now - probably partially because we have some experience with the topic, but we still miss a few things.

Considering that we want to work in class-incremental scenarios:

Does it mean that the only choice for us is single-headed models?
Therefore, we should have targets in the range 0-9, not 0-1?
How can we achieve that with splitMNIST and nc_benchmark?

Jul 04 '22 16:07 Nyderx

In class-incremental scenarios you do not have task label available, hence you either work with a single-headed model or you work with a multi-head model but you have to actively infer task labels, since the environment will not provide that.
The target range depends on the head, if you use a single head you need targets in the range 0, n_classes-1 (0-9 for split MNIST). If you use a multihead you have one linear classifier for each head, hence you need targets in the range 0,n_units_per_head-1 (0-1 for split MNIST with 5 heads and 2 units per head).
To work in class-incremental you can just set task_labels=False in both SplitMNIST and nc_benchmark. Task labels will always be 0 for each experience and targets will be in the range 0-9.

Hope this helps :smile:

Jul 04 '22 16:07 AndreaCossu

@AndreaCossu

SplitMNIST with return_task_id=True --> [0, 1]
SplitMNIST with return_task_id=False --> [0, 9]
nc_benchmark with task_labels=True -> [0, 9]
nc_benchmark with task_labels=False -> [0, 9]
dataset_benchmark with task_labels=True --> [0, 1] <- task_labels are provided though the usage of AvalancheDataset with task_lables flag
dataset_benchmark with task_labels=False --> [0, 1]

So questions :

why in nc_bnechmark in both cases the classes are [0, 9]. Should we add class_ids_from_zero_in_each_exp=True in the case of task_labels=True as well as ? (then it works properly )
why in dataset_benchmark in both cases classes are [0, 1] <- here we don't have any idea why it is always [0,1]. Moreover in this case the result with\without task id are similar (see previous attachments).

I want to ask about tensors_benchmark. In the tutorial there is the example :

generic_scenario = tensors_benchmark( train_tensors=[(experience_1_x, experience_1_y), (experience_2_x, experience_2_y)], test_tensors=[(test_x, test_y)], task_labels=[0, 0], # Task label of each train exp complete_test_set_only=True )

My understanding is that experience_1_y is the label of experience_1_x. So in the case of autoencoders I can put there train_tensors=[(experience_1_x, experience_1_x) and it should be fine <- neural network should be trained to reproduce input.

Jul 05 '22 15:07 dominik1102

Exactly, the nc_benchmark needs to be explicitly said the kind of targets you want. If you do not specify anything, it will use the entire set of targets. With class_ids_from_zero_in_each_exp=True you replicate the setup of SplitMNIST(return_task_id=True).
For dataset benchmark, the labels are taken from the AvalancheDataset. So, if you take the original MNIST train, wrap it into AvalancheDataset like this: AvalancheDataset(mnist_train, task_labels=0, targets_adapter=int) and then create one AvalancheSubset for each experience by providing the indices of the samples for the classes you want in each experience, you will obtain classes in [0,9].
For tensors_benchmark, yes that is correct.

Jul 08 '22 08:07 AndreaCossu

@AndreaCossu, @AntonioCarta, thank you for your help! It seems that we were able to learn more and get the expected results. We will probably have more questions for other things, but we will ask them on slack :)

Thank you also for your work on the library :)

Aug 23 '22 14:08 Nyderx

avalanche avalanche copied to clipboard

Issue with class-incremental SI and LwF

avalanche
avalanche copied to clipboard