DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

RecursionError: maximum recursion depth exceeded while calling a Python object

Open ShivanshuPurohit opened this issue 4 years ago • 4 comments

I was trying to run Megatron with ZeRO 2 config when I encountered this error

> finished creating GPT2 datasets ...
setting training data start iteration to 0
setting validation data start iteration to 0
done with setups ...
time (ms) | model and optimizer: 1894.21 | train/valid/test data iterators: 357.88
training ...
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 156, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/root/megatron-3d/megatron/training.py", line 97, in pretrain
    iteration = train(forward_step_func,
  File "/root/megatron-3d/megatron/training.py", line 481, in train
Traceback (most recent call last):
      File "pretrain_gpt2.py", line 156, in <module>
loss_dict, skipped_iter = train_step(forward_step_func,
  File "/root/megatron-3d/megatron/training.py", line 324, in train_step
    return train_step_pipe(model, data_iterator)
  File "/root/megatron-3d/megatron/training.py", line 358, in train_step_pipe
    loss = model.train_batch(data_iter=data_iterator)
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,    self._exec_schedule(sched)
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1162, in _exec_schedule

  File "/root/megatron-3d/megatron/training.py", line 97, in pretrain
        self._exec_instr(**cmd.kwargs)
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 621, in _exec_load_micro_batch
iteration = train(forward_step_func,
  File "/root/megatron-3d/megatron/training.py", line 481, in train
    batch = self._next_batch()
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
    return self._next_batch()
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
        return self._next_batch()
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
loss_dict, skipped_iter = train_step(forward_step_func,
  File "/root/megatron-3d/megatron/training.py", line 324, in train_step
    return self._next_batch()
  [Previous line repeated 978 more times]
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 469, in _next_batch
    return train_step_pipe(model, data_iterator)
  File "/root/megatron-3d/megatron/training.py", line 358, in train_step_pipe
    batch = self.batch_fn(batch)
  File "pretrain_gpt2.py", line 110, in get_batch_pipe
    return fp32_to_fp16((tokens, position_ids, attention_mask)), fp32_to_fp16((labels, loss_mask))
  File "/root/megatron-3d/megatron/fp16/fp16.py", line 53, in fp32_to_fp16
    return conversion_helper(val, half_conversion)
  File "/root/megatron-3d/megatron/fp16/fp16.py", line 38, in conversion_helper
    loss = model.train_batch(data_iter=data_iterator)
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
    rtn = [conversion_helper(v, conversion) for v in val]
  File "/root/megatron-3d/megatron/fp16/fp16.py", line 38, in <listcomp>
    rtn = [conversion_helper(v, conversion) for v in val]
  File "/root/megatron-3d/megatron/fp16/fp16.py", line 37, in conversion_helper
    return conversion(val)
  File "/root/megatron-3d/megatron/fp16/fp16.py", line 48, in half_conversion
    if isinstance(val_typecheck, (Parameter, Variable)):
  File "/root/anaconda3/lib/python3.8/site-packages/torch/autograd/variable.py", line 7, in __instancecheck__
        return isinstance(other, torch.Tensor)
self._exec_schedule(sched)RecursionError: maximum recursion depth exceeded while calling a Python object

This doesn't occur with the following config

{
  "train_batch_size": 224,
  "train_micro_batch_size_per_gpu": 4,
  "steps_per_print": 10,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0,
      "betas": [0.9, 0.95]
    }
  },
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,

    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "wall_clock_breakdown": true,
  "zero_allow_untested_optimizer": false
}

ShivanshuPurohit avatar Feb 09 '21 09:02 ShivanshuPurohit

Ditto @ShadenSmith

FannierPeng avatar Feb 18 '21 11:02 FannierPeng

Hi there, thanks for sharing and for the ping. Can you share the config that also reproduces, and which deepspeed version that you are using?

This issue was fixed for v0.3.11, which is available from source but not from PyPI yet. If you try out the code on our master branch, I'd be very curious if the crash goes away, or if you see a crash from a different location.

ShadenSmith avatar Feb 18 '21 14:02 ShadenSmith

Hi, I have the same issue in deepspeed version 0.8.0. I'm calling my python script with NCCL_DEBUG=INFO NCCL_BLOCKING_WAIT=1 deepspeed --num_gpus=4 --master_addr="myIP" --master_port=1234 --hostfile=job/hostfile myPythonScript.py. I'm using the huggingface Trainer implementation and the ds_config file from here: https://huggingface.co/docs/transformers/main/en/main_classes/deepspeed#zero3-config

The stack trace of the error is:

File "myPythonScript.py", line 230, in train
  trainer.train()
File "/miniconda3/envs/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1527, in train
  return inner_training_loop(
File "/miniconda3/envs/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1597, in _inner_training_loop
  deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/miniconda3/envs/venv/lib/python3.10/site-packages/transformers/deepspeed.py", line 344, in deepspeed_init
  deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
    return inner_training_loop(
  File "/miniconda3/envs/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1597, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
  File "/miniconda3/envs/venv/lib/python3.10/site-packages/transformers/deepspeed.py", line 344, in deepspeed_init
  engine = DeepSpeedEngine(args=args,
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 348, in wrapper
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 348, in wrapper
  if not hasattr(module, "_ds_child_entered"):
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 495, in __getattr__
    if not hasattr(module, "_ds_child_entered"):
  File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 495, in __getattr__
  if name in dir(self):
File "/home/ballin/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2022, in __dir__
  parameters = list(self._parameters.keys())
File "/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 495, in __getattr__
  if name in dir(self):
....
... multiple hundred lines of the same two function calls ....
....
File "/miniconda3/envs/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2028, in __dir__
  parameters = list(self._parameters.keys())
File "/mnt/ssestorage2-data/ballin/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 495, in __getattr__
  if name in dir(self):
File "/mnt/ssestorage2-data/ballin/miniconda3/envs/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2028, in __dir__
  parameters = list(self._parameters.keys())
File "/mnt/ssestorage2-data/ballin/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 495, in __getattr__
  if name in dir(self):
File "/mnt/ssestorage2-data/ballin/miniconda3/envs/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2026, in __dir__
  module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object

This is the output of ds_report:

MLFlow does not exist. Disabling MLFlow logging


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/miniconda3/envs/venv/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/miniconda3/envs/venv/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.8.0+unknown, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

bettyballin avatar Jan 26 '23 16:01 bettyballin

您好,已收到您的邮件。感谢!

FannierPeng avatar Jan 26 '23 16:01 FannierPeng