DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

deepspeed_aio_handle_t::_stop_threads(): Assertion `0 == _num_pending_ops' failed.

Open JIN-096 opened this issue 2 years ago • 0 comments

I keep having this trouble with Megatrion-LM-v1.1.5-ZeRO3/example/ds_pretrain_gpt2-zero3.sh and I'm not sure what is causing it. The error is below:

python: /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:159: void deepspeed_aio_handle_t::_stop_threads(): Assertion `0 == _num_pending_ops' failed.
Killing subprocess 484
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '../pretrain_gpt2.py', '--local_rank=0', '--model-parallel-size', '1', '--num-layers', '5', '--hidden-size', '1024', '--num-attention-heads', '16', '--seq-length', '1024', '--max-position-embeddings', '1024', '--batch-size', '4', '--train-iters', '1', '--lr-decay-iters', '1', '--load', 'checkpoints/gpt2_345m_ds', '--data-path', '/home/jjin/wiki-gpt_text_document', '--vocab-file', '/home/jjin/gpt2-vocab.json', '--merge-file', '/home/jjin/gpt2-merges.txt', '--data-impl', 'mmap', '--split', '949,50,1', '--distributed-backend', 'nccl', '--lr', '1.5e-4', '--lr-decay-style', 'cosine', '--min-lr', '1.0e-5', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '0.01', '--checkpoint-activations', '--log-interval', '1', '--eval-interval', '2000', '--eval-iters', '10', '--fp16', '--scattered-embeddings', '--split-transformers', '--deepspeed', '--deepspeed_config', '/workspace/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_infinity_config.json', '--zero-stage', '3', '--zero-reduce-bucket-size', '50000000', '--remote-device', 'nvme', '--zero-allgather-bucket-size', '5000000000', '--zero-contigious-gradients', '--zero-reduce-scatter', '--deepspeed-activation-checkpointing', '--checkpoint-num-layers', '1', '--partition-activations', '--checkpoint-in-cpu', '--synchronize-each-layer', '--contigious-checkpointing']' died with <Signals.SIGABRT: 6>.

I am using the example configurations from here: config in here modified just max_in_cpu 1 -> 1e9

Does anyone know why I am getting this issue?

Thank you!

JIN-096 avatar Mar 06 '22 08:03 JIN-096