小臣子吃大橙子 comments

Results 9 comments of


                                            小臣子吃大橙子

Stuck at epoch 1 iter 1 when train vqvae with multi-gpu

Well, I think I solved this by adding `torch.cuda.empty_cache()` at the end of each iteration.

Stuck at epoch 1 iter 1 when train vqvae with multi-gpu

Original ``` if dist.is_primary(): lr = optimizer.param_groups[0]["lr"] # ... if i % 100 == 0: model.eval() # ... with torch.no_grad(): out, _ = model(sample) # ... model.train() ``` Fixed ```...

Stuck at epoch 1 iter 1 when train vqvae with multi-gpu

perhaps you can try moving `model.train()` to the begining of each iter ``` for i, (img, label) in enumerate(loader): model.train() model.zero_grad() ``` I don't see anything else different from my...

Stuck at epoch 1 iter 1 when train vqvae with multi-gpu

``` python 3.9.13 torch 1.11.0+cu113 torchvision 0.12.0+cu113 ``` ``` python 3.9.13 torch 1.12.0+cu113 torchaudio 0.12.0+cu113 torchvision 0.13.0+cu113 ``` I've tested on two environments with different pytorch versions with CUDA11.2. I...

Watchdog caught collective operation timeout: WorkNCCL

I encountered the same problem when training on a **single 8*V100 node** with our own datasets. The code and environment are unchanged and worked fine before yesterday. However, this problem...

Watchdog caught collective operation timeout: WorkNCCL

Well, I run it on another node for a whole day, and it's just fine. So I think it's the node to blame. Rebooting the node might work, but I...

feat: Added judgment logic to support training with plain text data.

Hey guys, are there any updates for this error? > However, when image-text pair data and text-only data were included in the same batch, the following error occurred when running...

Assertion error regarding CUDA_VISIBLE_DEVICES encountered during training.

Same issue when using sbatch, but fine with salloc🤔 Have you solved this problem?

Assertion error regarding CUDA_VISIBLE_DEVICES encountered during training.

``` salloc --partition=gpu_llm --gres=gpu:8 --time=3-00:00:00 --mem=1000G --cpus-per-task=8 /bin/bash singularity exec --nv /public/home/.../verl_llamafactory_20250722.sif bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh ``` ✅ ``` #!/bin/bash #SBATCH --job-name=test_grpo #SBATCH --partition=gpu_llm #SBATCH --gres=gpu:8 #SBATCH --cpus-per-task=8 #SBATCH --time=3-00:00:00 #SBATCH --ntasks-per-node=1...