DeepSpeedExamples
DeepSpeedExamples copied to clipboard
RecursionError: maximum recursion depth exceeded while calling a Python object
I was trying to run Megatron with ZeRO 2 config when I encountered this error
> finished creating GPT2 datasets ...
setting training data start iteration to 0
setting validation data start iteration to 0
done with setups ...
time (ms) | model and optimizer: 1894.21 | train/valid/test data iterators: 357.88
training ...
Traceback (most recent call last):
File "pretrain_gpt2.py", line 156, in <module>
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/root/megatron-3d/megatron/training.py", line 97, in pretrain
iteration = train(forward_step_func,
File "/root/megatron-3d/megatron/training.py", line 481, in train
Traceback (most recent call last):
File "pretrain_gpt2.py", line 156, in <module>
loss_dict, skipped_iter = train_step(forward_step_func,
File "/root/megatron-3d/megatron/training.py", line 324, in train_step
return train_step_pipe(model, data_iterator)
File "/root/megatron-3d/megatron/training.py", line 358, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
pretrain(train_valid_test_datasets_provider, model_provider, forward_step, self._exec_schedule(sched)
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1162, in _exec_schedule
File "/root/megatron-3d/megatron/training.py", line 97, in pretrain
self._exec_instr(**cmd.kwargs)
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 621, in _exec_load_micro_batch
iteration = train(forward_step_func,
File "/root/megatron-3d/megatron/training.py", line 481, in train
batch = self._next_batch()
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
return self._next_batch()
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
return self._next_batch()
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
loss_dict, skipped_iter = train_step(forward_step_func,
File "/root/megatron-3d/megatron/training.py", line 324, in train_step
return self._next_batch()
[Previous line repeated 978 more times]
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 469, in _next_batch
return train_step_pipe(model, data_iterator)
File "/root/megatron-3d/megatron/training.py", line 358, in train_step_pipe
batch = self.batch_fn(batch)
File "pretrain_gpt2.py", line 110, in get_batch_pipe
return fp32_to_fp16((tokens, position_ids, attention_mask)), fp32_to_fp16((labels, loss_mask))
File "/root/megatron-3d/megatron/fp16/fp16.py", line 53, in fp32_to_fp16
return conversion_helper(val, half_conversion)
File "/root/megatron-3d/megatron/fp16/fp16.py", line 38, in conversion_helper
loss = model.train_batch(data_iter=data_iterator)
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
rtn = [conversion_helper(v, conversion) for v in val]
File "/root/megatron-3d/megatron/fp16/fp16.py", line 38, in <listcomp>
rtn = [conversion_helper(v, conversion) for v in val]
File "/root/megatron-3d/megatron/fp16/fp16.py", line 37, in conversion_helper
return conversion(val)
File "/root/megatron-3d/megatron/fp16/fp16.py", line 48, in half_conversion
if isinstance(val_typecheck, (Parameter, Variable)):
File "/root/anaconda3/lib/python3.8/site-packages/torch/autograd/variable.py", line 7, in __instancecheck__
return isinstance(other, torch.Tensor)
self._exec_schedule(sched)RecursionError: maximum recursion depth exceeded while calling a Python object
This doesn't occur with the following config
{
"train_batch_size": 224,
"train_micro_batch_size_per_gpu": 4,
"steps_per_print": 10,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015,
"max_grad_norm": 1.0,
"betas": [0.9, 0.95]
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true,
"zero_allow_untested_optimizer": false
}
Ditto @ShadenSmith
Hi there, thanks for sharing and for the ping. Can you share the config that also reproduces, and which deepspeed version that you are using?
This issue was fixed for v0.3.11, which is available from source but not from PyPI yet. If you try out the code on our master
branch, I'd be very curious if the crash goes away, or if you see a crash from a different location.
Hi, I have the same issue in deepspeed version 0.8.0. I'm calling my python script with NCCL_DEBUG=INFO NCCL_BLOCKING_WAIT=1 deepspeed --num_gpus=4 --master_addr="myIP" --master_port=1234 --hostfile=job/hostfile myPythonScript.py
.
I'm using the huggingface Trainer implementation and the ds_config file from here: https://huggingface.co/docs/transformers/main/en/main_classes/deepspeed#zero3-config
The stack trace of the error is:
File "myPythonScript.py", line 230, in train
trainer.train()
File "/miniconda3/envs/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1527, in train
return inner_training_loop(
File "/miniconda3/envs/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1597, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/miniconda3/envs/venv/lib/python3.10/site-packages/transformers/deepspeed.py", line 344, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
return inner_training_loop(
File "/miniconda3/envs/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1597, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/miniconda3/envs/venv/lib/python3.10/site-packages/transformers/deepspeed.py", line 344, in deepspeed_init
engine = DeepSpeedEngine(args=args,
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 348, in wrapper
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 348, in wrapper
if not hasattr(module, "_ds_child_entered"):
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 495, in __getattr__
if not hasattr(module, "_ds_child_entered"):
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 495, in __getattr__
if name in dir(self):
File "/home/ballin/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2022, in __dir__
parameters = list(self._parameters.keys())
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 495, in __getattr__
if name in dir(self):
....
... multiple hundred lines of the same two function calls ....
....
File "/miniconda3/envs/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2028, in __dir__
parameters = list(self._parameters.keys())
File "/mnt/ssestorage2-data/ballin/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 495, in __getattr__
if name in dir(self):
File "/mnt/ssestorage2-data/ballin/miniconda3/envs/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2028, in __dir__
parameters = list(self._parameters.keys())
File "/mnt/ssestorage2-data/ballin/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 495, in __getattr__
if name in dir(self):
File "/mnt/ssestorage2-data/ballin/miniconda3/envs/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2026, in __dir__
module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object
This is the output of ds_report:
MLFlow does not exist. Disabling MLFlow logging
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/miniconda3/envs/venv/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.8.0+unknown, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
您好,已收到您的邮件。感谢!