DeepSpeed [REQUEST] torch equivalent api model.no

Hi, I'm adding deepspeed for my distributed model training framework.

When using pytorch native apis, everything is fine. For distributed training, originally I would wrapp model in an object of nn.parallel.DistributedDataParallel, and use model.no_sync() api to avoid unnecessary sync ops.

I cannot find the equivalent api in deepspeed. Can you offer some help?

Apr 20 '22 01:04 tangzhy

Out of curiosity what sort of things do you want to do when sync is turned off? Is it gradient accumulation? If so, the deepspeed engine will make sure to not sync/communicate in between grad accumulation boundaries.

Apr 20 '22 02:04 jeffra

@jeffra In contrastive learning, which is super popular in research community currently, we usually rely on large batch size to compute the NCE softmax loss.

Though the remarkable GPU RAM saved by deepspeed, I would have to figure out how to scale the batch size from a small number like 32 to super large like 2048, or even more.

To do this, I would have to break the inputs into chunks, and compute the gradients chunk by chunk from the final logits to the backbone model.

With pytorch model.no_sync, I could first compute the gradients locally until the last chunk to perform sync globally.

For a more general purpose, I think deepspeed should consider expose this flexibility to users. Since this scenario is common for super large models, which is also deepspeed's main target.

Apr 20 '22 02:04 tangzhy

Can you show a small snippet of code on how you’re doing this in native pytorch? This sounds very much like gradient accumulation.

Does gradient_accumulation_steps do what you want? See: https://www.deepspeed.ai/docs/config-json/#batch-size-related-parameters

Having deepspeed support a similar pytorch style for no sync sounds beneficial. But curious if our existing method works for you for now?

Apr 20 '22 02:04 jeffra

I'm sure that gradient_accumulation cannot support my purpose, because gradient_accumulation only accumulates the gradients at the training example level.

For contrastive NCE softmax loss, I have to break the inputs into chunks and acuumulate the gradients at chunk level.

Here's the code,

        # Construct the gradient cache
        chunked_inputs = self.split_tensor_dict(inputs)
        for c in chunked_inputs:
            c['output_hidden_states'] = True
        cls_hiddens, rnd_states = self.gc.forward_no_grad(self.model.lm, chunked_inputs)
        if self.args.local_rank > -1:
            cls_hiddens = self.gather_tensors(cls_hiddens.contiguous())[0]
        grad_cache, total_loss = self.gc.build_cache(cls_hiddens)
        grad_cache = grad_cache[0]
        if self.args.local_rank > -1:
            total_loss = total_loss / dist.get_world_size()

        inputs['labels'] = labels
        chunked_inputs = self.split_tensor_dict(inputs)

        # Compute the full loss with cached gradients
        for local_chunk_id, chunk in enumerate(chunked_inputs):
            device_offset = max(0, self.args.local_rank) * self.args.per_device_train_batch_size * 2
            local_offset = local_chunk_id * self.args.cache_chunk_size
            chunk_offset = device_offset + local_offset
            with rnd_states[local_chunk_id]:
                if self.use_amp:
                    with autocast():
                        lm_loss, surrogate = self.compute_loss(model, chunk, grad_cache, chunk_offset)
                else:
                    lm_loss, surrogate = self.compute_loss(model, chunk, grad_cache, chunk_offset)

            if self.args.gradient_accumulation_steps > 1:
                raise ValueError

            ddp_no_sync = self.args.local_rank > -1 and (local_chunk_id + 1 < len(chunked_inputs))
            with model.no_sync() if ddp_no_sync else nullcontext():
                if self.use_amp:
                    (self.scaler.scale(lm_loss) + surrogate).backward()
                elif self.use_apex:
                    raise ValueError
                elif self.deepspeed:
                    raise ValueError
                else:
                    (lm_loss + surrogate).backward()
            total_loss += lm_loss

Without model.no_sync, the code would sync gradients for each chunk and dramatically increase the backward time cost. And I only want to sync the gradients until hitting the last chunk:)

Apr 20 '22 03:04 tangzhy

@jeffra Wonder if you will have a plan to add this feature? If so, does there exist a time expectation?

Apr 21 '22 11:04 tangzhy

To provide some context and an overview, @tangzhy is referring to the gradient caching technique, implemented here: https://github.com/luyug/GradCache (link to the paper is in the README). You can basically see it as "gradient accumulation for contrastive learning". The reason why vanilla gradient accumulation cannot be used directly is that (e.g. using 1 GPU) computing the contrastive loss requires all samples across one batch (to be used as negatives), while gradient accumulation only allows us to use the subset of samples inside the microbatch. In case of distributed training, we'd like to use all samples across all batches on all GPUs, in which case model.no_sync would be useful during the backward pass (the code and paper I linked will make this clear). I guess the question is, does a model wrapped by deepspeed by default do syncing of the gradients when executing .backward(), and if so, is there a way to prevent this? Thank you!

Aug 17 '22 00:08 gzerveas

@gzerveas Yes, thanks for clarification! @jeffra Deepspeed is critical for us to employ billion-scale models in contrastive learning. Looking forward to your thoughts:)

Aug 22 '22 03:08 tangzhy

@tangzhy have you figured out how to use DeepSpeed for GradCache? Thanks!

Sep 08 '23 04:09 memray

[REQUEST] torch equivalent api model.no_sync()