verl [Question] Does sequence parallelism effectively reduce GPU memory in forward pass?

When setting max_response_length >= 16k, I'm encountering OOM errors even with ulysses_sequence_parallel_size >= 2 when I set use_dynamic_bsz = False.

for epoch in range(self.config.ppo_epochs):
    for batch_idx, data in enumerate(dataloader):
        # split batch into micro_batches
        mini_batch = data
        if self.config.use_dynamic_bsz:
            max_token_len = self.config.ppo_max_token_len_per_gpu * self.ulysses_sequence_parallel_size
            micro_batches, _ = rearrange_micro_batches(batch=mini_batch, max_token_len=max_token_len)
        else:
            self.gradient_accumulation = self.config.ppo_mini_batch_size // self.config.ppo_micro_batch_size_per_gpu
            # split batch into micro_batches
            micro_batches = mini_batch.split(self.config.ppo_micro_batch_size_per_gpu)

        self.actor_optimizer.zero_grad()

        for data in micro_batches:
            data = data.cuda()  # actor device is cpu when using offload
            responses = data['responses']
            response_length = responses.size(1)
            attention_mask = data['attention_mask']
            response_mask = attention_mask[:, -response_length:]
            old_log_prob = data['old_log_probs']
            advantages = data['advantages']

            clip_ratio = self.config.clip_ratio
            entropy_coeff = self.config.entropy_coeff

            # all return: (bsz, response_length)
            entropy, log_prob = self._forward_micro_batch(micro_batch=data, temperature=temperature)

From the code above, I notice that data.cuda() is called before _forward_micro_batch() where sequence parallelism is implemented.

Question: Does this suggest the entire sample is loaded into GPU memory before being split for parallel processing? Is this defeating the memory reduction benefit of sequence parallelism? Should the CUDA transfer happen after the sequence is split?

Mar 03 '25 16:03 yechenzhi

The data size is relatively small compared to the activation/hidden states in the computation. So the data.cuda() is usually not a great deal in RL training.

A simply arithmetic for input_ids for a large : micro_bsz * seq_len * long_type = 256 * 16K * 8 = 32MB

Mar 03 '25 16:03 PeterSH6

与计算中的激活/隐藏状态相比，数据大小相对较小。因此 data.cuda() 在 RL 训练中通常用处不大。

对于较大的 input_ids 的简单算术：micro_bsz * seq_len * long_type = 256 * 16K * 8 = 32MB

I'd like to ask, if I want to increase the max_response_length with limited resources (8 * 80G), which parameters can I adjust to avoid Out of Memory (OOM)? I've already enabled offload and gradient_checkpointing, and adjusted the number of dynamic tokens to twice the sum of the prompt and response. If parameter adjustment doesn't work, can I meet the requirement by increasing the number of nodes?

It seems that Megatron already provides support(https://github.com/volcengine/verl/pull/495)? This should be a solution?

Mar 10 '25 09:03 nomadlx

Megatron full integration is in progress, it takes some time to use transformer_engine's support for models.

Mar 12 '25 06:03 ETOgaosion

Megatron full integration is in progress, it takes some time to use transformer_engine's support for models.

Can you provide a separate section in the document to describe the integration progress of Megatron? Currently, I have no way of knowing whether I can enable certain features of the specified version of Megatron, and which training strategies it currently supports.

Mar 13 '25 03:03 nomadlx

After features supported, we will provide information in https://verl.readthedocs.io/en/latest/workers/megatron_workers.html, and for more Megatron options included in ppo_megatron_trainer.yml, we will sync them to https://verl.readthedocs.io/en/latest/examples/config.html

Mar 13 '25 06:03 ETOgaosion