Ang issues

Repositories
Issues
Comments

Results 3 issues of

Ang

多gpu lora 报错

24G 3090上训练单卡训练lora占用内存 13G 左右多卡训练报错 RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` 监控GPU, 显存也是一直涨到24G然后报错, 是不是显存不够运行脚本如下: ``` CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 src/train_sft.py \ --model_name_or_path /wang/wangmodels/chatglm2-6b \ --use_v2 \ --do_train...

pending

加大batch_size 速度不变

为什么加大batchsize, 从 1 到 2 , step数目变了, 时间上没有变, 这是为什么呢?

solved

environment

can you show your pip list. i have problems with the environment, where mmcv, mmdet, mmpose are not compatible.