DeepSpeed [BUG] AttributeError: 'NewGELUActivation' object has no attribute '__flops_

Describe the bug When running https://github.com/microsoft/DeepSpeedExamples/tree/master/autotuning/hf/gpt2, it throws AttributeError: 'NewGELUActivation' object has no attribute 'flops'

To Reproduce ./test_tune.sh tune

Expected behavior Runs successfully

ds_report output [2022-06-23 22:20:46,880] [WARNING] [partition_parameters.py:60:] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] async_io ............... [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch'] torch version .................... 1.8.1+cu111 torch cuda version ............... 11.1 torch hip version ................ None nvcc version ..................... 11.1 deepspeed install path ........... ['/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.6.6+6719b46b, 6719b46b, master deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: 1 x GTX 1080
Python version 3.9.12

Additional context Add any other context about the problem here.

[INFO|modeling_utils.py:1997] 2022-06-23 22:02:30,770 >> loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at /home/xiaze/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925 [INFO|modeling_utils.py:2384] 2022-06-23 22:02:32,118 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:2392] 2022-06-23 22:02:32,118 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2. If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training. WARNING:datasets.fingerprint:Parameter 'function'=<function main..tokenize_function at 0x7f284c0863a0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-1c80317fa3b1799d.arrow INFO:datasets.fingerprint:Parameter 'function'=<function main..tokenize_function at 0x7f284c086040> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-bdd640fb06671ad1.arrow INFO:datasets.fingerprint:Parameter 'function'=<function main..tokenize_function at 0x7f284c0863a0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-3eb13b9046685257.arrow INFO:datasets.fingerprint:Parameter 'function'=<function main..group_texts at 0x7f284c086ee0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-23b8c1e9392456de.arrow INFO:datasets.fingerprint:Parameter 'function'=<function main..group_texts at 0x7f284c086040> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-1a3d1fa7bc8960a9.arrow INFO:datasets.fingerprint:Parameter 'function'=<function main..group_texts at 0x7f284c086040> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-bd9c66b3ad3c2d6d.arrow [INFO|trainer.py:478] 2022-06-23 22:02:33,775 >> max_steps is given, it will override any value given in num_train_epochs [INFO|trainer.py:533] 2022-06-23 22:02:33,775 >> Using cuda_amp half precision backend /home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( [INFO|trainer.py:1517] 2022-06-23 22:02:35,455 >> ***** Running training ***** [INFO|trainer.py:1518] 2022-06-23 22:02:35,455 >> Num examples = 2318 [INFO|trainer.py:1519] 2022-06-23 22:02:35,455 >> Num Epochs = 1 [INFO|trainer.py:1520] 2022-06-23 22:02:35,455 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1521] 2022-06-23 22:02:35,456 >> Total train batch size (w. parallel, distributed & accumulation) = 1 [INFO|trainer.py:1522] 2022-06-23 22:02:35,456 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1523] 2022-06-23 22:02:35,456 >> Total optimization steps = 200

0%| | 0/200 [00:00<?, ?it/s] 0%| | 1/200 [00:00<01:12, 2.76it/s] 1%| | 2/200 [00:00<01:18, 2.53it/s] 2%|▏ | 3/200 [00:01<01:09, 2.84it/s] 2%|▏ | 4/200 [00:01<01:14, 2.62it/s]Traceback (most recent call last): File "/home/xiaze/ds/transformers/examples/pytorch/language-modeling/run_clm.py", line 579, in main() File "/home/xiaze/ds/transformers/examples/pytorch/language-modeling/run_clm.py", line 527, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/trainer.py", line 1410, in train return inner_training_loop( File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/trainer.py", line 1652, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/trainer.py", line 2346, in training_step loss = self.compute_loss(model, inputs) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/trainer.py", line 2378, in compute_loss outputs = model(**inputs) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn return func(*args, **kwargs) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1616, in forward loss = self.module(*inputs, **kwargs) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1056, in forward transformer_outputs = self.transformer( File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward outputs = block( File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 436, in forward feed_forward_hidden_states = self.mlp(hidden_states) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 364, in forward hidden_states = self.act(hidden_states) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 893, in _call_impl hook_result = hook(self, input, result) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/deepspeed/profiling/flops_profiler/profiler.py", line 90, in post_hook module.flops += sum([elem[1] for elem in module_flop_count[-1]]) File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 947, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'NewGELUActivation' object has no attribute 'flops'

2%|▏ | 4/200 [00:01<01:16, 2.56it/s]

Jun 23 '22 14:06 xiazeyu

I ran into the same issue. The problem here is that huggingface instantiates activation function modules like NewGELUActivation at the python global scope. So, when deepspeed recursively registers hooks to model, several forward hooks are added for the same NewGELUActivation object. But, they all overwrite the same removable handle attributes saved to the object. Then, when deepspeed tries to remove these hooks, it can only remove 1, leaving several stale hooks on the module.

I can push a fix for this in a bit.

Jul 18 '22 20:07 Sanger2000

Let me know if the above branch fixes your problem.

Jul 18 '22 20:07 Sanger2000

DeepSpeed
DeepSpeed copied to clipboard

[BUG] AttributeError: 'NewGELUActivation' object has no attribute 'flops'

ds_report output [2022-06-23 22:20:46,880] [WARNING] [partition_parameters.py:60:] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

DeepSpeed DeepSpeed copied to clipboard

[BUG] AttributeError: 'NewGELUActivation' object has no attribute '__flops__'

ds_report output [2022-06-23 22:20:46,880] [WARNING] [partition_parameters.py:60:] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

DeepSpeed
DeepSpeed copied to clipboard

[BUG] AttributeError: 'NewGELUActivation' object has no attribute 'flops'