sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] Problem during merge to R1-V

Open BUAADreamer opened this issue 1 week ago • 1 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [x] 5. Please use English, otherwise it will be closed.

Describe the bug

When I try to add sglang to R1-V, it stucks during load offline engine

https://github.com/BUAADreamer/R1-V/blob/main/src/r1-v/src/open_r1/trainer/grpo_trainer.py#L338

Reproduction

install my fork of R1-V: https://github.com/BUAADreamer/R1-V

train command:

cd src/r1-v

export DEBUG_MODE="true" # Enable Debug if you want to see the rollout of model during RL
export LOG_PATH="./debug_log_2b.txt"

torchrun --nproc_per_node="8" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="12345" \
    src/open_r1/grpo.py \
    --output_dir <OUTPUT_DIR> \
    --model_name_or_path <PATH-TO-Qwen2-VL-2B-Instruct> \ 
    --dataset_name leonardPKU/clevr_cogen_a_train \  
    --deepspeed local_scripts/zero3.json \
    --max_prompt_length 512 \
    --max_completion_length 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --logging_steps 1 \
    --bf16 \
    --report_to wandb \
    --gradient_checkpointing false \
    --attn_implementation flash_attention_2 \
    --max_pixels 401408 \
    --num_train_epochs 2 \
    --run_name Qwen2-VL-2B-GRPO-CLEVR-70k \
    --save_steps 100 \
    --save_only_model true \
    --num_generations 8  \
    --sgl_model_path <PATH-TO-Qwen2-VL-2B-Instruct>

Environment

use commit: https://github.com/sgl-project/sglang/commit/522e18eaebc2f14135249bf95f24ce3505683082

As using newest transformers, I modify a little in sglang/srt/configs/qwen2_5_vl_config.py:

L51: `from transformers.image_utils import is_valid_list_of_images`
L1002-1003: commented out.

python3 -m sglang.check_env log:

2025-02-18 18:11:57,604 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
INFO 02-18 18:12:07 __init__.py:190] Automatically detected platform cuda.
Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda-11.7
NVCC: Cuda compilation tools, release 11.7, V11.7.99
CUDA Driver Version: 525.147.05
PyTorch: 2.5.1+cu124
sglang: 0.4.3.post2
sgl_kernel: 0.0.3.post5
flashinfer: 0.2.1.post2+cu118torch2.5
triton: 3.1.0
transformers: 4.49.0.dev0
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.20.0
vllm: 0.7.2
openai: 1.62.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology: 
	[4mGPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	CPU Affinity	NUMA Affinity[0m
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	NODE	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	0-31,64-95	0
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NODE	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	0-31,64-95	0
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NODE	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	0-31,64-95	0
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NODE	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	0-31,64-95	0
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	32-63,96-127	1
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	32-63,96-127	1
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PXB	32-63,96-127	1
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PXB	32-63,96-127	1
NIC0	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS		
NIC1	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	NODE	 X 	PIX	NODE	NODE	SYS	SYS	SYS	SYS		
NIC2	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	NODE	PIX	 X 	NODE	NODE	SYS	SYS	SYS	SYS		
NIC3	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	PIX	SYS	SYS	SYS	SYS		
NIC4	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	 X 	SYS	SYS	SYS	SYS		
NIC5	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE		
NIC6	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE		
NIC7	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX		
NIC8	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8


ulimit soft: 1048576

BUAADreamer avatar Feb 18 '25 10:02 BUAADreamer