avalanche icon indicating copy to clipboard operation
avalanche copied to clipboard

Support distributed training with torch.nn.DataParallel()

Open lxqpku opened this issue 2 years ago • 5 comments

🐝 Expected behavior

Support multiple GPU training.

How to set multiple GPUs to the device and train the model on multiple GPUs

lxqpku avatar Dec 13 '21 12:12 lxqpku

@lrzpellegrini Can you please tell me where exactly the changes (and tests) must be done? This issue seems too broad to me.

Thanks!

ashok-arjun avatar Dec 29 '21 04:12 ashok-arjun

Hi @ashok-arjun, I'm already working on this.

It's a very broad issue an it requires a lot of changes in different modules and I'm implementing this along with the checkpointing functionality.

lrzpellegrini avatar Dec 29 '21 22:12 lrzpellegrini

I'm pasting some notes that I took during our last meeting. Keep in mind that I have very limited experience with distributed training.

Dataloading

Distributed training requires to use a DistributedSampler (i.e. split samples among workers). Avalanche plugins may need to modify the data loading. How do we ensure they do not break with distributed training? Do we need to provide some guidelines? This is also a more general design question since I think that Online CL will require particular care about the data loading (which we are ignoring right now at a great performance cost).

Add Lazy Metrics?

Ideally, we should not synchronize the data at every iteration. This is a problem because metrics are evaluated on the global values after each iteration. I think we can avoid this with some changes to metrics:

  • by default, metrics are computed on global values, which means they are continuously synchronized (expensive).
  • we could add an additional operation to metrics (metric, not metricplugins), merge, which combines different metrics together. Not every metric needs this operation, but those who do implement it can be safely executed on the local state and then merged together when we want to synchronize the data. Ideally, the EvaluationPlugin should deal with the distributed code, while metrics could safely ignore it.

Tests

We should do a quick test to check that plugins have the same exact behavior on some small benchmark (distributed vs local). E.g. 2 epochs, 3 experiences on a toy benchmark, checking that the learning curves and final parameters are exactly the same.

Plugins behavior

The big question is: what is the default behavior when a plugin tries to access an attribute that is distributed among workers? do we do a synchronization and return the global version or do we return the local one? Examples:

  • In EWC, if we execute on each worker, we multiply the loss by the number of workers (and we don’t want this).
  • In GEM, we project the (global) gradient. How does it work in a distributed setting?
  • In LwF, we want the distillation on the local inputs. Otherwise, we have a synchronization and we also sum the loss. So, it seems to me that we have plugins that break in both cases (local vs global default). Tests can help here to quickly identify which plugins are broken.

AntonioCarta avatar Feb 11 '22 14:02 AntonioCarta

Is there any update on this functionality and where it is on the timeline for a next Avalanche version?

mattdl-meta avatar Jun 30 '22 21:06 mattdl-meta

We have an open PR #996, but we still need to test it more in depth. It should be ready for the next release but it's a big and complex feature so many things may go wrong. If you want to use avalanche with distributed training, for the moment I would suggest to define your own distributed training loop (doesn't have to be an avalanche strategy), where you call the benchmarks, models, and training plugins yourself.

AntonioCarta avatar Jul 04 '22 08:07 AntonioCarta