DeepSpeed
DeepSpeed copied to clipboard
[BUG] When using Zero-Infinity, Assertion `n_completes >= min_completes' failed
Describe the bug I can use my script to finetune model with zero 2 and 3. However, when I use zero infinity offloading parameters. the error occurs:
python: /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/aio/common/deepspeed_aio_common.cpp:125: int _do_io_complete(long long int, long long int, std::unique_ptr<aio_context>&, std::vector<std::chrono::duration
I searched google but without any results. This is really beyond my knowledge and I totally have no idea how to fix it.
Thank you!
To Reproduce Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
Expected behavior A clear and concise description of what you expected to happen.
ds_report output
ds_report
[2023-12-31 02:58:42,953] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch'] torch version .................... 2.1.0 deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.12.7+40342055, 40342055, master torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8 shared memory (/dev/shm) size .... 377.29 GB
System info (please complete the following information):
- OS: ubuntu 20.04
- GPU count and types: one A100 40GB
- Interconnects (if applicable)
- Python version : 3.10
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 /code/we_media/LLaMA-Factory/src/train_bash.py
--model_name_or_path /code/model/writer_1.3b_01_hf
--dataset_dir /code/dataset/
--output_dir /code/output/writer_1.3b
--flash_attn
--dataset dpo_data
--stage dpo
--do_train True
--finetuning_type lora
--template llama2_zh
--cutoff_len 16384
--learning_rate 1e-4
--preprocessing_num_workers 8
--num_train_epochs 1.0
--max_samples 1000000
--per_device_train_batch_size 1
--gradient_accumulation_steps 32
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 1
--save_steps 100
--warmup_steps 0
--neftune_noise_alpha 5
--lora_rank 128
--lora_alpha 256
--lora_dropout 0
--lora_target all
--bf16 True
--plot_loss True
--overwrite_output_dir True
--deepspeed ds_config_zero3.json