DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] When using Zero-Infinity, Assertion `n_completes >= min_completes' failed

Open IvoryTower800 opened this issue 1 year ago • 5 comments

Describe the bug I can use my script to finetune model with zero 2 and 3. However, when I use zero infinity offloading parameters. the error occurs:

python: /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/aio/common/deepspeed_aio_common.cpp:125: int _do_io_complete(long long int, long long int, std::unique_ptr<aio_context>&, std::vector<std::chrono::duration >&): Assertion `n_completes >= min_completes' failed. [2023-12-31 02:55:09,375] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 302

I searched google but without any results. This is really beyond my knowledge and I totally have no idea how to fix it.

Thank you!

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

ds_report output

ds_report

[2023-12-31 02:58:42,953] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch'] torch version .................... 2.1.0 deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.12.7+40342055, 40342055, master torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8 shared memory (/dev/shm) size .... 377.29 GB

System info (please complete the following information):

  • OS: ubuntu 20.04
  • GPU count and types: one A100 40GB
  • Interconnects (if applicable)
  • Python version : 3.10
  • Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 /code/we_media/LLaMA-Factory/src/train_bash.py
--model_name_or_path /code/model/writer_1.3b_01_hf
--dataset_dir /code/dataset/
--output_dir /code/output/writer_1.3b
--flash_attn
--dataset dpo_data
--stage dpo
--do_train True
--finetuning_type lora
--template llama2_zh
--cutoff_len 16384
--learning_rate 1e-4
--preprocessing_num_workers 8
--num_train_epochs 1.0
--max_samples 1000000
--per_device_train_batch_size 1
--gradient_accumulation_steps 32
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 1
--save_steps 100
--warmup_steps 0
--neftune_noise_alpha 5
--lora_rank 128
--lora_alpha 256
--lora_dropout 0
--lora_target all
--bf16 True
--plot_loss True
--overwrite_output_dir True
--deepspeed ds_config_zero3.json

IvoryTower800 avatar Dec 30 '23 19:12 IvoryTower800