Describe the bug
When running https://github.com/microsoft/DeepSpeedExamples/tree/master/autotuning/hf/gpt2, it throws AttributeError: 'NewGELUActivation' object has no attribute 'flops'
To Reproduce
./test_tune.sh tune
Expected behavior
Runs successfully
ds_report output
[2022-06-23 22:20:46,880] [WARNING] [partition_parameters.py:60:] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch']
torch version .................... 1.8.1+cu111
torch cuda version ............... 11.1
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.6.6+6719b46b, 6719b46b, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU count and types: 1 x GTX 1080
- Python version 3.9.12
Additional context
Add any other context about the problem here.
[INFO|modeling_utils.py:1997] 2022-06-23 22:02:30,770 >> loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at /home/xiaze/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925
[INFO|modeling_utils.py:2384] 2022-06-23 22:02:32,118 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:2392] 2022-06-23 22:02:32,118 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
WARNING:datasets.fingerprint:Parameter 'function'=<function main..tokenize_function at 0x7f284c0863a0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-1c80317fa3b1799d.arrow
INFO:datasets.fingerprint:Parameter 'function'=<function main..tokenize_function at 0x7f284c086040> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead.
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-bdd640fb06671ad1.arrow
INFO:datasets.fingerprint:Parameter 'function'=<function main..tokenize_function at 0x7f284c0863a0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead.
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-3eb13b9046685257.arrow
INFO:datasets.fingerprint:Parameter 'function'=<function main..group_texts at 0x7f284c086ee0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead.
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-23b8c1e9392456de.arrow
INFO:datasets.fingerprint:Parameter 'function'=<function main..group_texts at 0x7f284c086040> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead.
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-1a3d1fa7bc8960a9.arrow
INFO:datasets.fingerprint:Parameter 'function'=<function main..group_texts at 0x7f284c086040> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead.
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/xiaze/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-bd9c66b3ad3c2d6d.arrow
[INFO|trainer.py:478] 2022-06-23 22:02:33,775 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:533] 2022-06-23 22:02:33,775 >> Using cuda_amp half precision backend
/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True
to disable this warning
warnings.warn(
[INFO|trainer.py:1517] 2022-06-23 22:02:35,455 >> ***** Running training *****
[INFO|trainer.py:1518] 2022-06-23 22:02:35,455 >> Num examples = 2318
[INFO|trainer.py:1519] 2022-06-23 22:02:35,455 >> Num Epochs = 1
[INFO|trainer.py:1520] 2022-06-23 22:02:35,455 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1521] 2022-06-23 22:02:35,456 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1522] 2022-06-23 22:02:35,456 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1523] 2022-06-23 22:02:35,456 >> Total optimization steps = 200
0%| | 0/200 [00:00<?, ?it/s]
0%| | 1/200 [00:00<01:12, 2.76it/s]
1%| | 2/200 [00:00<01:18, 2.53it/s]
2%|▏ | 3/200 [00:01<01:09, 2.84it/s]
2%|▏ | 4/200 [00:01<01:14, 2.62it/s]Traceback (most recent call last):
File "/home/xiaze/ds/transformers/examples/pytorch/language-modeling/run_clm.py", line 579, in
main()
File "/home/xiaze/ds/transformers/examples/pytorch/language-modeling/run_clm.py", line 527, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/trainer.py", line 1410, in train
return inner_training_loop(
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/trainer.py", line 1652, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/trainer.py", line 2346, in training_step
loss = self.compute_loss(model, inputs)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/trainer.py", line 2378, in compute_loss
outputs = model(**inputs)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1616, in forward
loss = self.module(*inputs, **kwargs)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1056, in forward
transformer_outputs = self.transformer(
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward
outputs = block(
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 436, in forward
feed_forward_hidden_states = self.mlp(hidden_states)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 364, in forward
hidden_states = self.act(hidden_states)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 893, in _call_impl
hook_result = hook(self, input, result)
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/deepspeed/profiling/flops_profiler/profiler.py", line 90, in post_hook
module.flops += sum([elem[1] for elem in module_flop_count[-1]])
File "/home/xiaze/miniconda3/envs/ds_18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 947, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'NewGELUActivation' object has no attribute 'flops'
2%|▏ | 4/200 [00:01<01:16, 2.56it/s]
I ran into the same issue. The problem here is that huggingface instantiates activation function modules like NewGELUActivation
at the python global scope. So, when deepspeed recursively registers hooks to model, several forward hooks are added for the same NewGELUActivation
object. But, they all overwrite the same removable handle attributes saved to the object. Then, when deepspeed tries to remove these hooks, it can only remove 1, leaving several stale hooks on the module.
I can push a fix for this in a bit.