InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

post-training InternVL with RL objective using vllm engine and FSDP results in RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.

Open SStoica12 opened this issue 3 months ago • 6 comments

Hello,

This may be a shot in the dark, but I am wondering if anyone has tried post training InternVL with an RL objective, say GRPO, using vllm engine for the RL rollouts and FSDP and Ray for multiworker/gpu training. The framework I am using is similar to EasyR1. The reason I ask is because I am currently running into the following error:

modules/normalization.py", line 223, in forward
    return F.layer_norm(
  File "/python3.10/site-packages/torch/nn/functional.py", line 2911, in layer_norm
    return torch.layer_norm(
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.
If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html
If you're using Caffe2, Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
(WorkerDict pid=244854) /torch/_dynamo/eval_frame.py:745: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
(WorkerDict pid=244854)   return fn(*args, **kwargs)                                             
(pid=243604) Running step 0:   0%|                                   | 0.00/40.0 [06:56<?, ?it/s(tr_papo_ex)              
(pid=244363) Train mini-batches 1:  33%|███████▎              | 1.00/3.00 [00:03<00:06, 3.38s/it]
(pid=244363) Update policy 2:   2%|▍                          | 1.00/64.0 [00:03<03:33, 3.38s/it]

Specifically, when I try and access the weights and biases of the layer norm layer, I get: *** RuntimeError: setStorage: sizes [3], strides [1], storage offset 291409920, and itemsize 2 requiring a storage size of 582819846 are out of bounds for storage of size 0.00/80.0 [03:30<4:37:19, 211s/it] (WorkerDict pid=1442772) (Pdb)

Indicating that the weights and bias has already been freed.

Now, I know that vllm uses its own InternVLProcessor for prompt updates and InternVLMultimodalProcessor for the processing of multimodal prompts, so I tried to also use the InternVLProcessor from vllm to process my prompts befor passing them into the vllm engine; but the error persists. I have also tried using the InternVLProcessor from transformers (version 4.52.2) but the issue remains. I have also tried to deactivate gradient checkpointing.

I am new to vllm and FSDP, so I am unsure what the issue could be or how it could be. I am using vllm 0.8.4, torch 2.6.

SStoica12 avatar Sep 08 '25 06:09 SStoica12

Please try to use our codebase. We release the codebase, training scripts, conda packed environment, and training data.

Weiyun1025 avatar Sep 08 '25 12:09 Weiyun1025

Thank you for the reponse! Will take a look. EasyR1 is implemented to use HF and not repos directly.

Can I ask a specific implementation question? I notice that the token replacement is <image> -> <img><IMG_CONTEXT>*X</img> can I ask why you make a distinction between the image embedding token and the image token? The reason I ask is because transformers implements the replacement as <IMG_CONTEXT> -> <IMG_CONTEXT>*X by updating the chat template toconvert the image token to <IMG_CONTEXT>. I know that through image flags you are able to control what is seen as embedding and what is seen as padding, which enabled the token replacement under the hood to not cause any issues once the PIL images gets tokenized and needs to be merged with the text.

This type of replacement seems to be what is causing the issue I am running into, so would better like to understand this to hopefully find a solution.

SStoica12 avatar Sep 11 '25 07:09 SStoica12

I get the same problem, don't know how to solve the problem. I'm trying to freeze the internvl vision-tower,as the repo mentions "https://github.com/Qsingle/verl/blob/e057c06067090d1ad058ccbee47c961a79a26453/verl/workers/fsdp_workers.py#L3" line 300.

ZZYuting avatar Sep 12 '25 09:09 ZZYuting

After

I get the same problem, don't know how to solve the problem. I'm trying to freeze the internvl vision-tower,as the repo mentions "https://github.com/Qsingle/verl/blob/e057c06067090d1ad058ccbee47c961a79a26453/verl/workers/fsdp_workers.py#L3" line 300.

After freeze the vision tower, the error is disappeared. But, still don't know why.

ZZYuting avatar Sep 12 '25 09:09 ZZYuting

Fascinating, thank you for sharing, @ZZYuting . I will see if there is a way to do that with EasyR1. I will let you know if I it works and see if I can figure out the reason why it works.

SStoica12 avatar Sep 12 '25 18:09 SStoica12

Unfortunately, does not seem to fix issue with EasyR1. May due to the verl version Qsingle used is newer than that EasyR1 used....EasyR1 used Feb 21, 2025 version while Qsingle used Jul 2, or around that time, commit.

SStoica12 avatar Sep 12 '25 21:09 SStoica12