CompressAI icon indicating copy to clipboard operation
CompressAI copied to clipboard

Distributed Data Parallel support given that DataParallel may be deprecated in the next release of PyTorch

Open yoshitomo-matsubara opened this issue 2 years ago • 9 comments

Hi @jbegaint @fracape I'm still waiting for CompressAI's DDP support mentioned here. Could you please reconsider this option again?

I think it's a great timing to consider it given that PyTorch team is thinking of DataParallel deprecation with their upcoming v1.11 release of PyTorch (See the following issue) and many projects

https://github.com/pytorch/pytorch/issues/65936

Feature

Distributed Data Parallel support for faster model training

Motivation

  • PyTorch team is planning to deprecate DataParallel with their upcoming v1.11 release of PyTorch https://github.com/pytorch/pytorch/issues/65936
  • Many projects depend on this great framework as shown here, and faster model training with your DDP support would be very appreciated in the communities

Thank you!

yoshitomo-matsubara avatar Nov 16 '21 02:11 yoshitomo-matsubara

Isn't this already possible by wrapping the model via:

model = DistributedDataParallel(model)

If accessing .compress/etc is an issue, we can probably cheat and just forward those queries to the model.module instance:

class DistributedDataParallelCompressionModel(DistributedDataParallel):
    def __getattr__(self, name):
        try:
            return super().__getattr__(name)
        except AttributeError:
            return getattr(self.module, name)


model = DistributedDataParallelCompressionModel(model)

EDIT: We released CompressAI-Trainer, which should by default use all available GPUs. (This can be restricted to fewer GPU devices, e.g. only devices 0 and 1 via export CUDA_VISIBLE_DEVICES=0,1.) Give it a try! See Installation and Walkthrough.

YodaEmbedding avatar Mar 17 '23 05:03 YodaEmbedding

@YodaEmbedding Did you confirm it worked in distributed training mode? (I do not have a resource to test it out now)

It is a similar approach used in this repository to DataParallel, and I tried to use DistributedDataParallel in that way when I opened this issue. But it didn't work at that time, and the problem was not that simple.

yoshitomo-matsubara avatar Mar 17 '23 06:03 yoshitomo-matsubara

Hi @YodaEmbedding @yoshitomo-matsubara ,

As suggested by @YodaEmbedding , I tried to train bmshj2018-hyperprior in DDP setting from scratch. I used two V100-16GB GPU's with batch size of 16 each (total in 32). I kept every other setting as default but ofcourse adapted examples/train.py to accomadate DDP training (pretty straightforward to do so). I was able to train and my takeaways/results are as follows.

  • Generally, I have seen that the AUX loss remains very high in DDP mode. For instance, in Non-DDP training mode it was in the range of [20,30], whereas in the DDP mode it was in the range of [100,120]. Is it because of the batch size? since in Non-DDP mode, batch size was ofcourse less than the DDP mode. However, I am reporting the AUX loss per GPU (only local rank) in case of DDP so maybe not sure if this is the issue. Is it because of some bug in the code or considered normal?

  • Other losses such as BPP and RD Loss were fine. They were more or less the same as in Non-DDP setting.

  • Since I was using a very large batch size, network converged around 200ish epochs (Vime90K dataset) and training was very fast, took me only 4 days (can be made faster as well i guess).

  • Here are my results (KODAK Dataset) of bmshj2018-hyperprior on quality level 3. I used all of the default settings, didnt changed anything in loss e.g., lambda etc


 Using trained model checkpoint_best_loss-ans
{
  "name": "bmshj2018-hyperprior-mse",
  "description": "Inference (ans)",
  "results": {
    "psnr-rgb": [
      32.051838397979736
    ],
    "ms-ssim-rgb": [
      0.9703270271420479
    ],
    "bpp": [
      0.407596164279514
    ],
    "encoding_time": [
      0.6645485758781433
    ],
    "decoding_time": [
      0.900387833515803
    ]
  }
}

  • For a fair comparison, I try to compare this with pre-trained CompressAI models. I noticed that in order to reach this BPP, I had to increase the quality parameter by one (used quality=3 in DDP mode and quality=4 in CompressAI pre-trained models).
 Using trained model bmshj2018-hyperprior-mse-4-ans
{
  "name": "bmshj2018-hyperprior-mse",
  "description": "Inference (ans)",
  "results": {
    "psnr-rgb": [
      32.826677878697716
    ],
    "ms-ssim-rgb": [
      0.9747917304436365
    ],
    "bpp": [
      0.47835625542534715
    ],
    "encoding_time": [
      0.6658469438552856
    ],
    "decoding_time": [
      0.915264755487442
    ]
  }
}
  • As you can see, DDP results are a bit better than pre-trained CompressAI model. Is it a bug or maybe using a bigger batch size helped or something else?

Please share your ideas, I am happy to share DDP training code as well if you would like to see it. Thanks

danishnazir avatar Apr 01 '23 09:04 danishnazir

Hi @danishnazir

Thank you for testing it out. Could you explain how you executed your script in distributed training mode? It should be like torchrun --nproc_per_node=2 ... or python3 -m torch.distributed.launch --nproc_per_node=2 ...

This is the 2nd issue I made for DDP support, and in the 1st issue there were multiple users waiting for DDP support, and at that time we could not resolve it by a simple wrapper for DistributedDataParallel like suggested above (though I forgot to share the exact errors there)

Generally, I have seen that the AUX loss remains very high in DDP mode. For instance, in Non-DDP training mode it was in the range of [20,30], whereas in the DDP mode it was in the range of [100,120]. Is it because of the batch size? since in Non-DDP mode, batch size was ofcourse less than the DDP mode. However, I am reporting the AUX loss per GPU (only local rank) in case of DDP so maybe not sure if this is the issue. Is it because of some bug in the code or considered normal?

Probably, the default reduction for MSE is mean https://github.com/InterDigitalInc/CompressAI/blob/53275cf5e03f83b0ab8ab01372849bfdc9ef5f1c/compressai/losses/rate_distortion.py#L47 , but aux_loss is taking sum. That may be why only AUX loss turned out to be relatively high due to a large batch size. https://github.com/InterDigitalInc/CompressAI/blob/master/compressai/models/base.py#L117-L146

Since I was using a very large batch size, network converged around 200ish epochs (Vime90K dataset) and training was very fast, took me only 4 days (can be made faster as well i guess).

It is a little bit surprising to me that it still takes 4 days to train a model even in distributed training mode. Did it take more than 4 days with DP instead of DDP?

yoshitomo-matsubara avatar Apr 01 '23 17:04 yoshitomo-matsubara

Hi @yoshitomo-matsubara

Thank you for your response.

Could you explain how you executed your script in distributed training mode?

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --node_rank=0 examples/train.py -d /path/to/dataset/ --epochs 300 -lr 1e-4 --batch-size 16 --cuda --save

but aux_loss is taking sum. That may be why only AUX loss turned out to be relatively high due to a large batch size.

Yeah, make sense. But I dont understand that since its the sum per GPU in the DDP mode (atleast how I understand it) , how this is very high? or i am missing something here? Moreover, does this affect the performance in general?

It is a little bit surprising to me that it still takes 4 days to train a model even in distributed training mode. Did it take more than 4 days with DP instead of DDP?

You are right, its a bit surprising for me as well that it took this long. I dont remember exactly how much time it took in the DP mode. Maybe i need to test that or if maybe someone from CompressAI can confirm the time. Furthermore, I would also argue that it didnt exactly take 4 days , it was more or less the estimate. Moreover, I trained for 270 epochs but looking at the logs it was already converging around 200 epochs.

danishnazir avatar Apr 02 '23 10:04 danishnazir

We released CompressAI-Trainer, which should by default use all available GPUs. (This can be restricted to fewer GPU devices, e.g. only devices 0 and 1 via export CUDA_VISIBLE_DEVICES=0,1.) Give it a try! See Installation and Walkthrough.

https://interdigitalinc.github.io/CompressAI-Trainer/tutorials/full.html#single-gpu-and-multi-gpu-training

YodaEmbedding avatar Apr 02 '23 11:04 YodaEmbedding

Hi @danishnazir

Yeah, make sense. But I dont understand that since its the sum per GPU in the DDP mode (atleast how I understand it) , how this is very high? or i am missing something here? Moreover, does this affect the performance in general?

Probably, we need to see the actual code for better understanding it as the command you provided looks like the right way to use distributed training mode. If the code requires minimal changes to use DDP, perhaps you want to submit a PR (mentioning this issue) and request code review by someone from InterDitialInc (I am not).

yoshitomo-matsubara avatar Apr 03 '23 02:04 yoshitomo-matsubara

Hi @YodaEmbedding

We released CompressAI-Trainer, which should by default use all available GPUs. (This can be restricted to fewer GPU devices, e.g. only devices 0 and 1 via export CUDA_VISIBLE_DEVICES=0,1.) Give it a try! See Installation and Walkthrough.

Thank you for sharing that. From this line, I assume that the trainer supports DDP. It would be awesome if example/train.py in this repo can support DDP as well if @danishnazir can submit the PR

yoshitomo-matsubara avatar Apr 03 '23 02:04 yoshitomo-matsubara

Hi @yoshitomo-matsubara , Yes I will submit the PR as soon as possible. Thanks.

danishnazir avatar Apr 03 '23 11:04 danishnazir