DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Floating Point Exception (core dump) at launch_attn_softmax_v2<float>

Open codertimo opened this issue 2 years ago • 3 comments

Describe the bug

I tried to infer gpt2 model with under code. The code use the DeepSpeed inference optimization. When I constantly repeated model inference, floating point exception(core dump) occurred and program dies.

I tried to debug where the error comes from using gdb and it shows that the exception occured at launch_attn_softmax_v2<float>.

Interestingly, pip install with commit d7684f4 version works fine without any exception. I think there is a bug after the d7684f4 commit.

I attached log, gdb output please refer the attachment.

Thanks :)

  • worked with @ckw1140

To Reproduce

import deepspeed
import torch
from transformers import GPT2Config, GPT2LMHeadModel

TINY_GPT_CONFIG = {
    "activation_function": "gelu_new",
    "architectures": ["GPT2LMHeadModel"],
    "attn_pdrop": 0.1,
    "bos_token_id": 50256,
    "embd_pdrop": 0.1,
    "eos_token_id": 50256,
    "initializer_range": 0.02,
    "layer_norm_epsilon": 1e-05,
    "model_type": "gpt2",
    "n_ctx": 1024,
    "n_embd": 2,
    "n_head": 2,
    "n_layer": 2,
    "n_positions": 1024,
    "resid_pdrop": 0.1,
    "summary_activation": None,
    "summary_first_dropout": 0.1,
    "summary_proj_to_labels": True,
    "summary_type": "cls_index",
    "summary_use_proj": True,
    "task_specific_params": {"text-generation": {"do_sample": True, "max_length": 50}},
    "vocab_size": 50257,
}


def main():
    device = torch.device("cuda")
    config = GPT2Config.from_dict(TINY_GPT_CONFIG)
    model = GPT2LMHeadModel(config)
    model.to(device)
    model.eval()

    deepspeed_model = deepspeed.init_inference(
        model=model,
        mp_size=1,
        dtype=torch.float32,
        replace_method="auto",
        replace_with_kernel_inject=True,
    )

    for i in range(200):
        with torch.inference_mode():
            input_ids = torch.tensor([list(range(242))], dtype=torch.long, device=device)
            context_token_length = input_ids.size(1)

            outputs = deepspeed_model.module.generate(
                input_ids,
                max_length=context_token_length + 16,
                min_length=context_token_length + 16,
                pad_token_id=0,
            )
        print(i, outputs.size())


if __name__ == "__main__":
    main()

Expected behavior

floating-point exception should not occured.

ds_report output

ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/codertimo/generation-serving-1/env/lib/python3.7/site-packages/torch']
torch version .................... 1.10.0+cu102
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/codertimo/generation-serving-1/env/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.6.3, unknown, unknown
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

GDB output

2022-04-29 15:50:05,083] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.3, git-hash=unknown, git-branch=unknown
[2022-04-29 15:50:05,083] [INFO] [engine.py:197:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /home/codertimo/.cache/torch_extensions/py37_cu102 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/codertimo/.cache/torch_extensions/py37_cu102/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.6923525333404541 seconds
DeepSpeed Transformer Inference config is  {'layer_id': 0, 'hidden_size': 2, 'intermediate_size': 8, 'heads': 2, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1}
DeepSpeed Transformer Inference config is  {'layer_id': 1, 'hidden_size': 2, 'intermediate_size': 8, 'heads': 2, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1}
[2022-04-29 15:50:06,281] [INFO] [engine.py:130:__init__] Place model to device: 0
0 torch.Size([1, 258])
1 torch.Size([1, 258])
2 torch.Size([1, 258])
3 torch.Size([1, 258])
4 torch.Size([1, 258])
5 torch.Size([1, 258])
6 torch.Size([1, 258])
7 torch.Size([1, 258])
8 torch.Size([1, 258])
9 torch.Size([1, 258])
10 torch.Size([1, 258])
11 torch.Size([1, 258])
12 torch.Size([1, 258])
13 torch.Size([1, 258])
14 torch.Size([1, 258])
15 torch.Size([1, 258])
16 torch.Size([1, 258])
17 torch.Size([1, 258])
18 torch.Size([1, 258])
19 torch.Size([1, 258])
20 torch.Size([1, 258])
21 torch.Size([1, 258])
22 torch.Size([1, 258])
23 torch.Size([1, 258])
24 torch.Size([1, 258])
25 torch.Size([1, 258])
26 torch.Size([1, 258])
27 torch.Size([1, 258])
28 torch.Size([1, 258])
29 torch.Size([1, 258])
30 torch.Size([1, 258])
31 torch.Size([1, 258])
32 torch.Size([1, 258])
33 torch.Size([1, 258])
34 torch.Size([1, 258])
35 torch.Size([1, 258])
36 torch.Size([1, 258])
37 torch.Size([1, 258])
38 torch.Size([1, 258])
39 torch.Size([1, 258])
40 torch.Size([1, 258])
41 torch.Size([1, 258])
42 torch.Size([1, 258])
43 torch.Size([1, 258])
44 torch.Size([1, 258])
45 torch.Size([1, 258])
46 torch.Size([1, 258])
47 torch.Size([1, 258])
48 torch.Size([1, 258])
49 torch.Size([1, 258])
50 torch.Size([1, 258])
51 torch.Size([1, 258])
52 torch.Size([1, 258])
53 torch.Size([1, 258])
54 torch.Size([1, 258])
55 torch.Size([1, 258])
56 torch.Size([1, 258])
57 torch.Size([1, 258])
58 torch.Size([1, 258])
59 torch.Size([1, 258])
60 torch.Size([1, 258])
61 torch.Size([1, 258])
62 torch.Size([1, 258])
63 torch.Size([1, 258])
64 torch.Size([1, 258])
65 torch.Size([1, 258])
66 torch.Size([1, 258])
67 torch.Size([1, 258])
68 torch.Size([1, 258])
69 torch.Size([1, 258])
70 torch.Size([1, 258])
71 torch.Size([1, 258])
72 torch.Size([1, 258])
73 torch.Size([1, 258])
74 torch.Size([1, 258])
75 torch.Size([1, 258])
76 torch.Size([1, 258])
77 torch.Size([1, 258])
78 torch.Size([1, 258])
79 torch.Size([1, 258])
80 torch.Size([1, 258])
81 torch.Size([1, 258])
82 torch.Size([1, 258])
83 torch.Size([1, 258])
84 torch.Size([1, 258])
85 torch.Size([1, 258])
86 torch.Size([1, 258])
87 torch.Size([1, 258])
88 torch.Size([1, 258])
89 torch.Size([1, 258])
90 torch.Size([1, 258])
91 torch.Size([1, 258])
92 torch.Size([1, 258])
93 torch.Size([1, 258])
94 torch.Size([1, 258])
95 torch.Size([1, 258])
96 torch.Size([1, 258])
97 torch.Size([1, 258])
98 torch.Size([1, 258])
99 torch.Size([1, 258])
100 torch.Size([1, 258])
101 torch.Size([1, 258])
102 torch.Size([1, 258])
103 torch.Size([1, 258])
104 torch.Size([1, 258])
105 torch.Size([1, 258])
106 torch.Size([1, 258])
107 torch.Size([1, 258])
108 torch.Size([1, 258])
109 torch.Size([1, 258])
110 torch.Size([1, 258])
111 torch.Size([1, 258])
112 torch.Size([1, 258])
113 torch.Size([1, 258])
114 torch.Size([1, 258])
115 torch.Size([1, 258])
116 torch.Size([1, 258])
117 torch.Size([1, 258])
118 torch.Size([1, 258])
119 torch.Size([1, 258])
120 torch.Size([1, 258])
121 torch.Size([1, 258])
122 torch.Size([1, 258])
123 torch.Size([1, 258])
124 torch.Size([1, 258])
125 torch.Size([1, 258])
126 torch.Size([1, 258])
Floating point exception (core dumped)

Launcher context

I launched without deepspeed lancher. just python model.py

Additional context

GDB output

GNU gdb (Ubuntu 8.1-0ubuntu3.1) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
(gdb) run
Starting program: /home/codertimo/generation-serving-1/env/bin/python model_test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fff9205f700 (LWP 40476)]
[New Thread 0x7fff9185e700 (LWP 40477)]
[New Thread 0x7fff8d05d700 (LWP 40478)]
[New Thread 0x7fff8c85c700 (LWP 40479)]
[New Thread 0x7fff8805b700 (LWP 40480)]
[New Thread 0x7fff8585a700 (LWP 40482)]
[New Thread 0x7fff83059700 (LWP 40483)]
[New Thread 0x7fff82858700 (LWP 40484)]
[New Thread 0x7fff7e057700 (LWP 40485)]
[New Thread 0x7fff7b856700 (LWP 40486)]
[New Thread 0x7fff79055700 (LWP 40487)]
[New Thread 0x7fff76854700 (LWP 40488)]
[New Thread 0x7fff76053700 (LWP 40489)]
[New Thread 0x7fff71852700 (LWP 40490)]
[New Thread 0x7fff6f051700 (LWP 40491)]
[New Thread 0x7fff6c850700 (LWP 40492)]
[New Thread 0x7fff6c04f700 (LWP 40493)]
[New Thread 0x7fff6984e700 (LWP 40494)]
[New Thread 0x7fff6704d700 (LWP 40495)]
[New Thread 0x7fff6284c700 (LWP 40496)]
[New Thread 0x7fff6004b700 (LWP 40497)]
[New Thread 0x7fff5d84a700 (LWP 40499)]
[New Thread 0x7fff5b049700 (LWP 40500)]
[New Thread 0x7fff5a848700 (LWP 40501)]
[New Thread 0x7fff56047700 (LWP 40502)]
[New Thread 0x7fff53846700 (LWP 40503)]
[New Thread 0x7fff51045700 (LWP 40504)]
[New Thread 0x7fff4e844700 (LWP 40505)]
[New Thread 0x7fff4c043700 (LWP 40506)]
[New Thread 0x7fff49842700 (LWP 40507)]
[New Thread 0x7fff49041700 (LWP 40508)]
[New Thread 0x7fff44840700 (LWP 40509)]
[New Thread 0x7fff4403f700 (LWP 40510)]
[New Thread 0x7fff3f83e700 (LWP 40511)]
[New Thread 0x7fff3d03d700 (LWP 40512)]
[New Thread 0x7fff3c83c700 (LWP 40513)]
[New Thread 0x7fff3c03b700 (LWP 40514)]
[New Thread 0x7fff3583a700 (LWP 40515)]
[New Thread 0x7fff33039700 (LWP 40516)]
[New Thread 0x7fff284e9700 (LWP 40521)]
[New Thread 0x7fff27ce8700 (LWP 40522)]
[New Thread 0x7fff274e7700 (LWP 40523)]
[New Thread 0x7fff26ce6700 (LWP 40524)]
[New Thread 0x7fff264e5700 (LWP 40525)]
[New Thread 0x7fff25ce4700 (LWP 40526)]
[New Thread 0x7fff254e3700 (LWP 40527)]
[New Thread 0x7fff24ce2700 (LWP 40528)]
[New Thread 0x7fff244e1700 (LWP 40529)]
[New Thread 0x7fff23ce0700 (LWP 40530)]
[New Thread 0x7fff234df700 (LWP 40531)]
[New Thread 0x7fff22cde700 (LWP 40532)]
[New Thread 0x7fff224dd700 (LWP 40533)]
[New Thread 0x7fff21cdc700 (LWP 40534)]
[New Thread 0x7fff214db700 (LWP 40535)]
[New Thread 0x7fff20cda700 (LWP 40536)]
[New Thread 0x7fff204d9700 (LWP 40537)]
[New Thread 0x7fff1fcd8700 (LWP 40538)]
[New Thread 0x7fff1f4d7700 (LWP 40539)]
[New Thread 0x7fff1d4a6700 (LWP 40553)]
[New Thread 0x7fff1cca5700 (LWP 40554)]
[2022-04-29 15:51:22,670] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.3, git-hash=unknown, git-branch=unknown
[2022-04-29 15:51:22,670] [INFO] [engine.py:197:_init_quantization_setting] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[Thread 0x7fff44840700 (LWP 40509) exited]
[Thread 0x7fff33039700 (LWP 40516) exited]
[Thread 0x7fff3583a700 (LWP 40515) exited]
[Thread 0x7fff3c03b700 (LWP 40514) exited]
[Thread 0x7fff3c83c700 (LWP 40513) exited]
[Thread 0x7fff3d03d700 (LWP 40512) exited]
[Thread 0x7fff3f83e700 (LWP 40511) exited]
[Thread 0x7fff4403f700 (LWP 40510) exited]
[Thread 0x7fff49041700 (LWP 40508) exited]
[Thread 0x7fff49842700 (LWP 40507) exited]
[Thread 0x7fff4c043700 (LWP 40506) exited]
[Thread 0x7fff4e844700 (LWP 40505) exited]
[Thread 0x7fff51045700 (LWP 40504) exited]
[Thread 0x7fff53846700 (LWP 40503) exited]
[Thread 0x7fff56047700 (LWP 40502) exited]
[Thread 0x7fff5a848700 (LWP 40501) exited]
[Thread 0x7fff5b049700 (LWP 40500) exited]
[Thread 0x7fff5d84a700 (LWP 40499) exited]
[Thread 0x7fff6004b700 (LWP 40497) exited]
[Thread 0x7fff6284c700 (LWP 40496) exited]
[Thread 0x7fff6704d700 (LWP 40495) exited]
[Thread 0x7fff6984e700 (LWP 40494) exited]
[Thread 0x7fff6c04f700 (LWP 40493) exited]
[Thread 0x7fff6c850700 (LWP 40492) exited]
[Thread 0x7fff6f051700 (LWP 40491) exited]
[Thread 0x7fff71852700 (LWP 40490) exited]
[Thread 0x7fff76053700 (LWP 40489) exited]
[Thread 0x7fff76854700 (LWP 40488) exited]
[Thread 0x7fff79055700 (LWP 40487) exited]
[Thread 0x7fff7b856700 (LWP 40486) exited]
[Thread 0x7fff7e057700 (LWP 40485) exited]
[Thread 0x7fff82858700 (LWP 40484) exited]
[Thread 0x7fff83059700 (LWP 40483) exited]
[Thread 0x7fff8585a700 (LWP 40482) exited]
[Thread 0x7fff8805b700 (LWP 40480) exited]
[Thread 0x7fff8c85c700 (LWP 40479) exited]
[Thread 0x7fff8d05d700 (LWP 40478) exited]
[Thread 0x7fff9185e700 (LWP 40477) exited]
[Thread 0x7fff9205f700 (LWP 40476) exited]
Using /home/codertimo/.cache/torch_extensions/py37_cu102 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/codertimo/.cache/torch_extensions/py37_cu102/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.8379650115966797 seconds
DeepSpeed Transformer Inference config is  {'layer_id': 0, 'hidden_size': 2, 'intermediate_size': 8, 'heads': 2, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1}
DeepSpeed Transformer Inference config is  {'layer_id': 1, 'hidden_size': 2, 'intermediate_size': 8, 'heads': 2, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1}
[2022-04-29 15:51:24,013] [INFO] [engine.py:130:__init__] Place model to device: 0
0 torch.Size([1, 258])
1 torch.Size([1, 258])
2 torch.Size([1, 258])
3 torch.Size([1, 258])
4 torch.Size([1, 258])
5 torch.Size([1, 258])
6 torch.Size([1, 258])
7 torch.Size([1, 258])
8 torch.Size([1, 258])
9 torch.Size([1, 258])
10 torch.Size([1, 258])
11 torch.Size([1, 258])
12 torch.Size([1, 258])
13 torch.Size([1, 258])
14 torch.Size([1, 258])
15 torch.Size([1, 258])
16 torch.Size([1, 258])
17 torch.Size([1, 258])
18 torch.Size([1, 258])
19 torch.Size([1, 258])
20 torch.Size([1, 258])
21 torch.Size([1, 258])
22 torch.Size([1, 258])
23 torch.Size([1, 258])
24 torch.Size([1, 258])
25 torch.Size([1, 258])
26 torch.Size([1, 258])
27 torch.Size([1, 258])
28 torch.Size([1, 258])
29 torch.Size([1, 258])
30 torch.Size([1, 258])
31 torch.Size([1, 258])
32 torch.Size([1, 258])
33 torch.Size([1, 258])
34 torch.Size([1, 258])
35 torch.Size([1, 258])
36 torch.Size([1, 258])
37 torch.Size([1, 258])
38 torch.Size([1, 258])
39 torch.Size([1, 258])
40 torch.Size([1, 258])
41 torch.Size([1, 258])
42 torch.Size([1, 258])
43 torch.Size([1, 258])
44 torch.Size([1, 258])
45 torch.Size([1, 258])
46 torch.Size([1, 258])
47 torch.Size([1, 258])
48 torch.Size([1, 258])
49 torch.Size([1, 258])
50 torch.Size([1, 258])
51 torch.Size([1, 258])
52 torch.Size([1, 258])
53 torch.Size([1, 258])
54 torch.Size([1, 258])
55 torch.Size([1, 258])
56 torch.Size([1, 258])
57 torch.Size([1, 258])
58 torch.Size([1, 258])
59 torch.Size([1, 258])
60 torch.Size([1, 258])
61 torch.Size([1, 258])
62 torch.Size([1, 258])
63 torch.Size([1, 258])
64 torch.Size([1, 258])
65 torch.Size([1, 258])
66 torch.Size([1, 258])
67 torch.Size([1, 258])
68 torch.Size([1, 258])
69 torch.Size([1, 258])
70 torch.Size([1, 258])
71 torch.Size([1, 258])
72 torch.Size([1, 258])
73 torch.Size([1, 258])
74 torch.Size([1, 258])
75 torch.Size([1, 258])
76 torch.Size([1, 258])
77 torch.Size([1, 258])
78 torch.Size([1, 258])
79 torch.Size([1, 258])
80 torch.Size([1, 258])
81 torch.Size([1, 258])
82 torch.Size([1, 258])
83 torch.Size([1, 258])
84 torch.Size([1, 258])
85 torch.Size([1, 258])
86 torch.Size([1, 258])
87 torch.Size([1, 258])
88 torch.Size([1, 258])
89 torch.Size([1, 258])
90 torch.Size([1, 258])
91 torch.Size([1, 258])
92 torch.Size([1, 258])
93 torch.Size([1, 258])
94 torch.Size([1, 258])
95 torch.Size([1, 258])
96 torch.Size([1, 258])
97 torch.Size([1, 258])
98 torch.Size([1, 258])
99 torch.Size([1, 258])
100 torch.Size([1, 258])
101 torch.Size([1, 258])
102 torch.Size([1, 258])
103 torch.Size([1, 258])
104 torch.Size([1, 258])
105 torch.Size([1, 258])
106 torch.Size([1, 258])
107 torch.Size([1, 258])
108 torch.Size([1, 258])
109 torch.Size([1, 258])
110 torch.Size([1, 258])
111 torch.Size([1, 258])
112 torch.Size([1, 258])
113 torch.Size([1, 258])
114 torch.Size([1, 258])
115 torch.Size([1, 258])
116 torch.Size([1, 258])
117 torch.Size([1, 258])
118 torch.Size([1, 258])
119 torch.Size([1, 258])
120 torch.Size([1, 258])
121 torch.Size([1, 258])
122 torch.Size([1, 258])
123 torch.Size([1, 258])
124 torch.Size([1, 258])
125 torch.Size([1, 258])
126 torch.Size([1, 258])

Thread 1 "python" received signal SIGFPE, Arithmetic exception.
0x00007fff91e1fc4a in void launch_attn_softmax_v2<float>(float*, float*, bool, bool, bool, int, int, int, int, int, float, CUstream_st*) ()
   from /home/codertimo/.cache/torch_extensions/py37_cu102/transformer_inference/transformer_inference.so
(gdb) bt
#0  0x00007fff91e1fc4a in void launch_attn_softmax_v2<float>(float*, float*, bool, bool, bool, int, int, int, int, int, float, CUstream_st*) ()
   from /home/codertimo/.cache/torch_extensions/py37_cu102/transformer_inference/transformer_inference.so
#1  0x00007fff91e07233 in ds_softmax<float> (attn_scores=..., attn_mask=..., triangular=triangular@entry=false, recompute=recompute@entry=false, 
    local_attention=local_attention@entry=false, window_size=1, async_op=false)
    at /home/codertimo/generation-serving-1/env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:35
#2  0x00007fff91e0a104 in attention_unfused<float> (prev_key_cont=..., query_cont=..., attn_mask=..., prev_value_cont=..., output=..., bsz=@0x7fffffffa1ac: 1, 
    seq_len=@0x7fffffffa1b0: 242, soft_len=@0x7fffffffa1b4: 32881, heads=@0x7fffffffa258: 2, norm_factor=@0x7fffffffa19c: 1, triangular=false, recompute=false, 
    local_attention=false, window_size=1)
    at /home/codertimo/generation-serving-1/env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:139
#3  0x00007fff91e0a485 in ds_softmax_context<float> (query=..., prev_key=..., new_key=..., attn_mask=..., prev_value=..., new_value=..., heads=<optimized out>, 
    norm_factor=<optimized out>, merging=false, triangular=true, local_attention=false, window_size=1, no_masking=false)
    at /home/codertimo/generation-serving-1/env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:195
#4  0x00007fff91e1ba69 in pybind11::detail::argument_loader<at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, bool, bool, bool, int, bool>::call_impl<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::vector<at::Tensor, std::allocator<at::Tensor> > (*&)(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, bool, bool, bool, int, bool), 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, 10ul, 11ul, 12ul, pybind11::detail::void_type>(std::vector<at::Tensor, std::allocator<at::Tensor> > (*&)(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, bool, bool, bool, int, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, 10ul, 11ul, 12ul>, pybind11::detail::void_type&&) && (f=<optimized out>, this=0x7fffffffa2c0)
    at /home/codertimo/generation-serving-1/env/lib/python3.7/site-packages/torch/include/pybind11/cast.h:2042
#5  pybind11::detail::argument_loader<at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, bool, bool, bool, int, bool>::call<std::vector<at::Tensor, std::allocator<at::Tensor> >, pybind11::detail::void_type, std::vector<at::Tensor, std::allocator<at::Tensor> > (*&)(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, bool, bool, bool, int, bool)>(std::vector<at::Tensor, std::allocator<at::Tensor> > (*&)(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, bool, bool, bool, int, bool)) && (f=<optimized out>, this=<optimized out>)
    at /home/codertimo/generation-serving-1/env/lib/python3.7/site-packages/torch/include/pybind11/cast.h:2014
#6  void pybind11::cpp_function::initialize<std::vector<at::Tensor, std::allocator<at::Tensor> > (*&)(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, bool, bool, bool, int, bool), std::vector<at::Tensor, std::allocator<at::Tensor> >, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, bool, bool, bool, int, bool, pybind11::name, pybind11::scope, pybind11::sibling, char [37]>(std::vector<at::Tensor, std::allocator<at::Tensor> > (*&)(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, bool, bool, bool, int, bool), std::vector<at::Tensor, std::allocator<at::Tensor> > (*)(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, float, bool, bool, bool, int, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [37])::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const (__closure=<optimized out>, call=...)
    at /home/codertimo/generation-serving-1/env/lib/python3.7/site-packages/torch/include/pybind11/pybind11.h:192
#7  0x00007fff91e17139 in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7fff1c431e10, kwargs_in=0x0)
    at /home/codertimo/generation-serving-1/env/lib/python3.7/site-packages/torch/include/pybind11/pybind11.h:767
#8  0x00000000005d7f64 in _PyMethodDef_RawFastCallKeywords ()
#9  0x0000000000551ab4 in _PyEval_EvalFrameDefault ()
#10 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#11 0x00000000005d8bd2 in _PyFunction_FastCallKeywords ()
#12 0x000000000054dd08 in _PyEval_EvalFrameDefault ()
#13 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#14 0x00000000005d8bd2 in _PyFunction_FastCallKeywords ()
#15 0x000000000054dd08 in _PyEval_EvalFrameDefault ()
#16 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#17 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#18 0x00007ffff25dbaf0 in THPFunction_apply(_object*, _object*) () from /home/codertimo/generation-serving-1/env/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#19 0x00000000005d7e72 in _PyMethodDef_RawFastCallKeywords ()
#20 0x000000000054a9c0 in ?? ()
#21 0x0000000000551c08 in _PyEval_EvalFrameDefault ()
#22 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#23 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#24 0x00000000004d8102 in ?? ()
#25 0x00000000005dbbc6 in PyObject_Call ()
#26 0x000000000054f0e4 in _PyEval_EvalFrameDefault ()
#27 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#28 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#29 0x000000000059418a in ?? ()
#30 0x00000000005d96db in _PyObject_FastCallKeywords ()
#31 0x000000000054aa51 in ?? ()
#32 0x0000000000551c08 in _PyEval_EvalFrameDefault ()
#33 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#34 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#35 0x00000000004d8102 in ?? ()
#36 0x00000000005dbbc6 in PyObject_Call ()
#37 0x000000000054f0e4 in _PyEval_EvalFrameDefault ()
#38 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#39 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#40 0x000000000059412b in ?? ()
#41 0x00000000005d96db in _PyObject_FastCallKeywords ()
#42 0x000000000054aa51 in ?? ()
#43 0x000000000054ebbd in _PyEval_EvalFrameDefault ()
#44 0x000000000054b302 in _PyEval_EvalCodeWithName ()
---Type <return> to continue, or q <return> to quit---
#45 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#46 0x00000000004d8102 in ?? ()
#47 0x00000000005dbbc6 in PyObject_Call ()
#48 0x000000000054f0e4 in _PyEval_EvalFrameDefault ()
#49 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#50 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#51 0x000000000059412b in ?? ()
#52 0x00000000005d96db in _PyObject_FastCallKeywords ()
#53 0x000000000054aa51 in ?? ()
#54 0x000000000054ebbd in _PyEval_EvalFrameDefault ()
#55 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#56 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#57 0x00000000004d8102 in ?? ()
#58 0x00000000005dbbc6 in PyObject_Call ()
#59 0x000000000054f0e4 in _PyEval_EvalFrameDefault ()
#60 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#61 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#62 0x000000000059412b in ?? ()
#63 0x00000000005dbbc6 in PyObject_Call ()
#64 0x000000000054f0e4 in _PyEval_EvalFrameDefault ()
#65 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#66 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#67 0x00000000004d8102 in ?? ()
#68 0x00000000005dbbc6 in PyObject_Call ()
#69 0x000000000054f0e4 in _PyEval_EvalFrameDefault ()
#70 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#71 0x00000000005d9dbe in _PyFunction_FastCallDict ()
#72 0x000000000054f0e4 in _PyEval_EvalFrameDefault ()
#73 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#74 0x00000000005d8bd2 in _PyFunction_FastCallKeywords ()
#75 0x000000000054a880 in ?? ()
#76 0x000000000054ebbd in _PyEval_EvalFrameDefault ()
#77 0x00000000005d88dc in _PyFunction_FastCallKeywords ()
#78 0x000000000054dd08 in _PyEval_EvalFrameDefault ()
#79 0x000000000054b302 in _PyEval_EvalCodeWithName ()
#80 0x000000000054d803 in PyEval_EvalCode ()
#81 0x00000000006308e2 in ?? ()
#82 0x0000000000630997 in PyRun_FileExFlags ()
#83 0x000000000063160f in PyRun_SimpleFileExFlags ()
#84 0x000000000065450e in ?? ()
#85 0x000000000065486e in _Py_UnixMain ()
#86 0x00007ffff7a05b97 in __libc_start_main (main=0x4b84d0 <main>, argc=2, argv=0x7fffffffdb98, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffffffdb88) at ../csu/libc-start.c:310
#87 0x00000000005df80a in _start ()

codertimo avatar Apr 29 '22 07:04 codertimo

I think this commit(b4fcd98) occurs the issue. previous commit(32d979) works fine.

ckw1140 avatar Apr 29 '22 07:04 ckw1140

@jeffra Could you check this issue? I think https://github.com/microsoft/DeepSpeed/pull/1899 made the issue.

codertimo avatar Apr 29 '22 07:04 codertimo

I print out the all variable in the function and found the main reason for this issue!

1. Zero-Division Error happened in this line. sequence_length is 32880

https://github.com/microsoft/DeepSpeed/blob/89e37ef360dddf10bed996734784e290b9b5fc62/csrc/transformer/inference/csrc/softmax.cu#L386

2. Because layer_past is not free even after one iteration is finished!

https://github.com/microsoft/DeepSpeed/blob/89e37ef360dddf10bed996734784e290b9b5fc62/deepspeed/ops/transformer/inference/transformer_inference.py#L621

I think layer past is accumulated every time the model inference happens!

3. It works when I remove the above line!

And the sequence length is normal as we expected!

sequence_length 240
sequence_length 241
sequence_length 241
sequence_length 242
sequence_length 242
sequence_length 243
sequence_length 243
sequence_length 244
sequence_length 244
sequence_length 245
sequence_length 245
sequence_length 246
sequence_length 246
sequence_length 247
sequence_length 247
sequence_length 248
sequence_length 248
sequence_length 249
sequence_length 249
sequence_length 250
sequence_length 250
sequence_length 251
sequence_length 251
sequence_length 252
sequence_length 252
sequence_length 253
sequence_length 253
sequence_length 254
sequence_length 254
sequence_length 255
sequence_length 255

codertimo avatar May 01 '22 12:05 codertimo

@codertimo, I tried to reproduce this issue on latest deepspeed master and here are my observations

  1. please change n_embed to a value > 8 to work with DeepSpeed.
  2. After changing n_embed to > 8 The script you provided passes without any issues.

Please confirm this you your side.

lokoppakmsft avatar Dec 09 '22 18:12 lokoppakmsft

I experienced this issue as well for tag 0.7.7 when trying to use DeepSpeed inference for GPTJ. This issue occurred for both float16 and float32. I am about to test master now.

mallorbc avatar Jan 11 '23 21:01 mallorbc

    model = AutoModelForCausalLM.from_pretrained(self.config.model,torch_dtype=torch.float16)
    tokenizer = AutoTokenizer.from_pretrained(self.config.model)
    local_rank = 0
    world_size = 1
    generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=local_rank,torch_dtype=torch.float16)

    generator.model = deepspeed.init_inference(generator.model,
                                        mp_size=world_size,
                                        dtype=torch.half,
                                        replace_method='auto',
                                        max_tokens=self.config.max_tokens,
                    replace_with_kernel_inject=True)

mallorbc avatar Jan 11 '23 21:01 mallorbc

Can confirm I have the same issue with the master branch

mallorbc avatar Jan 11 '23 22:01 mallorbc

@lokoppakmsft it may be a good idea to reopen this issue.

mallorbc avatar Jan 11 '23 22:01 mallorbc