post-training InternVL with RL objective using vllm engine and FSDP results in RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.
Hello,
This may be a shot in the dark, but I am wondering if anyone has tried post training InternVL with an RL objective, say GRPO, using vllm engine for the RL rollouts and FSDP and Ray for multiworker/gpu training. The framework I am using is similar to EasyR1. The reason I ask is because I am currently running into the following error:
modules/normalization.py", line 223, in forward
return F.layer_norm(
File "/python3.10/site-packages/torch/nn/functional.py", line 2911, in layer_norm
return torch.layer_norm(
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.
If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html
If you're using Caffe2, Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
(WorkerDict pid=244854) /torch/_dynamo/eval_frame.py:745: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
(WorkerDict pid=244854) return fn(*args, **kwargs)
(pid=243604) Running step 0: 0%| | 0.00/40.0 [06:56<?, ?it/s(tr_papo_ex)
(pid=244363) Train mini-batches 1: 33%|███████▎ | 1.00/3.00 [00:03<00:06, 3.38s/it]
(pid=244363) Update policy 2: 2%|▍ | 1.00/64.0 [00:03<03:33, 3.38s/it]
Specifically, when I try and access the weights and biases of the layer norm layer, I get:
*** RuntimeError: setStorage: sizes [3], strides [1], storage offset 291409920, and itemsize 2 requiring a storage size of 582819846 are out of bounds for storage of size 0.00/80.0 [03:30<4:37:19, 211s/it] (WorkerDict pid=1442772) (Pdb)
Indicating that the weights and bias has already been freed.
Now, I know that vllm uses its own InternVLProcessor for prompt updates and InternVLMultimodalProcessor for the processing of multimodal prompts, so I tried to also use the InternVLProcessor from vllm to process my prompts befor passing them into the vllm engine; but the error persists. I have also tried using the InternVLProcessor from transformers (version 4.52.2) but the issue remains. I have also tried to deactivate gradient checkpointing.
I am new to vllm and FSDP, so I am unsure what the issue could be or how it could be. I am using vllm 0.8.4, torch 2.6.
Please try to use our codebase. We release the codebase, training scripts, conda packed environment, and training data.
Thank you for the reponse! Will take a look. EasyR1 is implemented to use HF and not repos directly.
Can I ask a specific implementation question? I notice that the token replacement is <image> -> <img><IMG_CONTEXT>*X</img> can I ask why you make a distinction between the image embedding token and the image token? The reason I ask is because transformers implements the replacement as <IMG_CONTEXT> -> <IMG_CONTEXT>*X by updating the chat template toconvert the image token
This type of replacement seems to be what is causing the issue I am running into, so would better like to understand this to hopefully find a solution.
I get the same problem, don't know how to solve the problem. I'm trying to freeze the internvl vision-tower,as the repo mentions "https://github.com/Qsingle/verl/blob/e057c06067090d1ad058ccbee47c961a79a26453/verl/workers/fsdp_workers.py#L3" line 300.
After
I get the same problem, don't know how to solve the problem. I'm trying to freeze the internvl vision-tower,as the repo mentions "https://github.com/Qsingle/verl/blob/e057c06067090d1ad058ccbee47c961a79a26453/verl/workers/fsdp_workers.py#L3" line 300.
After freeze the vision tower, the error is disappeared. But, still don't know why.
Fascinating, thank you for sharing, @ZZYuting . I will see if there is a way to do that with EasyR1. I will let you know if I it works and see if I can figure out the reason why it works.
Unfortunately, does not seem to fix issue with EasyR1. May due to the verl version Qsingle used is newer than that EasyR1 used....EasyR1 used Feb 21, 2025 version while Qsingle used Jul 2, or around that time, commit.