MiniCPM
MiniCPM copied to clipboard
[Feature Request]: Can you provide a detailed requirements.txt
Feature request / 功能建议
Your nice work helps me a lot!
I meet some bugs when finetuning the openbmb/MiniCPM-2B-sft-bf16
, I guess it should be caused by version inconsistency of some packages(torch, accelerate, etc.), I have checked requirements here, could you provide a detailed requirements.txt?
Thanks.
Could you paste your bug report? Thanks!
It does not crash directly, but it create multiple processes on cuda 0.
Would u mind paste your script? It seems not correctly using CUDA VISIBLE DEVICES for isolation
It is our internal tool-kits and is adapted to many transformer based models.The script
deepspeed --num_gpus 8 benchmark.py \
-it \
-t_data $TRAINDATA \
-te \
-v_data $EVALDATA \
--model_path $BASEMODEL \
--model_name $2 \
--gen_config $3 \
--bf16 \
-output_dir $OUTDIR \
-m_bsz $4 \
-e_bsz $4 \
-max_len 1024 \
--max_steps 3072 \
--save_steps 1024 \
--template_name none \
-lr 2e-5 \
-bsz 64 \
--gradient_checkpointing \
--train_files_pattern '/*/train/*.jsonl' \
--val_files_pattern '/*/eval/*.jsonl' \
-output \
--deepspeed true \
I have encountered similar problems before, usually a bug in memory management of a certain library, such as torch/deepspeed/peft or flash_attn on a certain CUDA version. So I guess it must be some version mismatch in our envs.
It is our internal tool-kits and is adapted to many transformer based models.The script
deepspeed --num_gpus 8 benchmark.py \ -it \ -t_data $TRAINDATA \ -te \ -v_data $EVALDATA \ --model_path $BASEMODEL \ --model_name $2 \ --gen_config $3 \ --bf16 \ -output_dir $OUTDIR \ -m_bsz $4 \ -e_bsz $4 \ -max_len 1024 \ --max_steps 3072 \ --save_steps 1024 \ --template_name none \ -lr 2e-5 \ -bsz 64 \ --gradient_checkpointing \ --train_files_pattern '/*/train/*.jsonl' \ --val_files_pattern '/*/eval/*.jsonl' \ -output \ --deepspeed true \
I have encountered similar problems before, usually a bug in memory management of a certain library, such as torch/deepspeed/peft or flash_attn on a certain CUDA version. So I guess it must be some version mismatch in our envs.
It appears that benchmark.py is not included in our repository. Could you please provide more details? My suspicion is that device_map might be the root cause.
It is our internal tool-kits. In short, can you please provide your version of cuda, torch, deepspeed, flash_attn, xformers, and other key packages.