MiniCPM [Feature Request]: Can you provide a detailed requirements.txt

Feature request / 功能建议

Your nice work helps me a lot! I meet some bugs when finetuning the openbmb/MiniCPM-2B-sft-bf16, I guess it should be caused by version inconsistency of some packages(torch, accelerate, etc.), I have checked requirements here, could you provide a detailed requirements.txt? Thanks.

Feb 01 '24 09:02 fpcsong

Could you paste your bug report? Thanks!

Feb 01 '24 11:02 ShengdingHu

It does not crash directly, but it create multiple processes on cuda 0.

Feb 01 '24 11:02 fpcsong

Would u mind paste your script? It seems not correctly using CUDA VISIBLE DEVICES for isolation

Feb 01 '24 14:02 SwordFaith

It is our internal tool-kits and is adapted to many transformer based models.The script

        deepspeed --num_gpus 8 benchmark.py \
        -it \
        -t_data $TRAINDATA \
        -te \
        -v_data $EVALDATA \
        --model_path $BASEMODEL \
        --model_name $2 \
        --gen_config $3 \
        --bf16 \
        -output_dir $OUTDIR \
        -m_bsz $4 \
        -e_bsz $4 \
        -max_len 1024 \
        --max_steps 3072 \
        --save_steps 1024 \
        --template_name none \
        -lr 2e-5 \
        -bsz 64 \
        --gradient_checkpointing \
        --train_files_pattern  '/*/train/*.jsonl' \
        --val_files_pattern '/*/eval/*.jsonl' \
        -output \
        --deepspeed true \

I have encountered similar problems before, usually a bug in memory management of a certain library, such as torch/deepspeed/peft or flash_attn on a certain CUDA version. So I guess it must be some version mismatch in our envs.

Feb 02 '24 00:02 fpcsong

It is our internal tool-kits and is adapted to many transformer based models.The script
        deepspeed --num_gpus 8 benchmark.py \
        -it \
        -t_data $TRAINDATA \
        -te \
        -v_data $EVALDATA \
        --model_path $BASEMODEL \
        --model_name $2 \
        --gen_config $3 \
        --bf16 \
        -output_dir $OUTDIR \
        -m_bsz $4 \
        -e_bsz $4 \
        -max_len 1024 \
        --max_steps 3072 \
        --save_steps 1024 \
        --template_name none \
        -lr 2e-5 \
        -bsz 64 \
        --gradient_checkpointing \
        --train_files_pattern  '/*/train/*.jsonl' \
        --val_files_pattern '/*/eval/*.jsonl' \
        -output \
        --deepspeed true \
I have encountered similar problems before, usually a bug in memory management of a certain library, such as torch/deepspeed/peft or flash_attn on a certain CUDA version. So I guess it must be some version mismatch in our envs.

It appears that benchmark.py is not included in our repository. Could you please provide more details? My suspicion is that device_map might be the root cause.

Feb 02 '24 11:02 SwordFaith

It is our internal tool-kits. In short, can you please provide your version of cuda, torch, deepspeed, flash_attn, xformers, and other key packages.

Feb 05 '24 03:02 fpcsong

MiniCPM MiniCPM copied to clipboard

[Feature Request]: Can you provide a detailed requirements.txt

Feature request / 功能建议

MiniCPM
MiniCPM copied to clipboard