Yunqi Yan comments

Repositories
Issues
Comments

Results 3 comments of


                                            Yunqi Yan

Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3

Same here. I used a quad-RTX 4090 setup (~96GB VRAM) for testing, but it still ran into OOM.

Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3

> I was able to run the code successfully on a machine with 4xRTX3090 (totally 96GB of VRAM), by setting "train_batch_size" and "validation_batch_size" both to 1 in "train.py". (As suggested,...

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter

> pip install py-spy > py-spy dump --pid For my case, the py-spy result is: ``` Process 3250260: /home/user/miniconda3/envs/swift/bin/python3.11 -u /home/user/Desktop/GRPO/grpo_swift/ms-swift/swift/cli/rlhf.py --rlhf_type grpo --model /home/user/Desktop/GRPO/grpo_swift/output/sft/v4-20250530-192816/checkpoint-2319-merged --reward_funcs external_r1v_acc format --reward_weights 1...