DeepSpeedExamples
DeepSpeedExamples copied to clipboard
deepspeed_aio_handle_t::_stop_threads(): Assertion `0 == _num_pending_ops' failed.
I keep having this trouble with Megatrion-LM-v1.1.5-ZeRO3/example/ds_pretrain_gpt2-zero3.sh and I'm not sure what is causing it. The error is below:
python: /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:159: void deepspeed_aio_handle_t::_stop_threads(): Assertion `0 == _num_pending_ops' failed.
Killing subprocess 484
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '../pretrain_gpt2.py', '--local_rank=0', '--model-parallel-size', '1', '--num-layers', '5', '--hidden-size', '1024', '--num-attention-heads', '16', '--seq-length', '1024', '--max-position-embeddings', '1024', '--batch-size', '4', '--train-iters', '1', '--lr-decay-iters', '1', '--load', 'checkpoints/gpt2_345m_ds', '--data-path', '/home/jjin/wiki-gpt_text_document', '--vocab-file', '/home/jjin/gpt2-vocab.json', '--merge-file', '/home/jjin/gpt2-merges.txt', '--data-impl', 'mmap', '--split', '949,50,1', '--distributed-backend', 'nccl', '--lr', '1.5e-4', '--lr-decay-style', 'cosine', '--min-lr', '1.0e-5', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '0.01', '--checkpoint-activations', '--log-interval', '1', '--eval-interval', '2000', '--eval-iters', '10', '--fp16', '--scattered-embeddings', '--split-transformers', '--deepspeed', '--deepspeed_config', '/workspace/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_infinity_config.json', '--zero-stage', '3', '--zero-reduce-bucket-size', '50000000', '--remote-device', 'nvme', '--zero-allgather-bucket-size', '5000000000', '--zero-contigious-gradients', '--zero-reduce-scatter', '--deepspeed-activation-checkpointing', '--checkpoint-num-layers', '1', '--partition-activations', '--checkpoint-in-cpu', '--synchronize-each-layer', '--contigious-checkpointing']' died with <Signals.SIGABRT: 6>.
I am using the example configurations from here: config in here modified just max_in_cpu 1 -> 1e9
Does anyone know why I am getting this issue?
Thank you!