DeepSpeed
DeepSpeed copied to clipboard
[BUG] use 8 32GB V100 and use_meta_tensor to inference big model. Cannot copy out of meta tensor; no data!
My deepspeed version is 0.8.1 , my torch version is 1.13.1 and my transformer version is transformers==4.21.2. My CPU memory is 500GB
I follow the document to run my code.
- The below is my script
deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b --batch_size ${BS} --test_performance --dtype int8 --use_meta_tensor
and
deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b --batch_size ${BS} --test_performance --dtype float16 --use_meta_tensor\
my error is
File "inference-test.py", line 111, in <module>
outputs = pipe(inputs,
File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__
outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)
File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 115, in generate_outputs
self.model.cuda().to(self.device)
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply
param_applied = fn(param)
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in <lambda>
return self._apply(lambda t: t.cuda(device))
NotImplementedError: Cannot copy out of meta tensor; no data!
[2023-02-19 06:47:26,453] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12532
[2023-02-19 06:47:26,672] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12587
[2023-02-19 06:47:26,891] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12615
Traceback (most recent call last):
File "inference-test.py", line 111, in <module>
outputs = pipe(inputs,
File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__
outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)
File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 115, in generate_outputs
self.model.cuda().to(self.device)
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply
param_applied = fn(param)
File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in <lambda>
- then I try another script
deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b --batch_size ${BS} --test_performance --dtype int8
and
deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b --batch_size ${BS} --test_performance --dtype float16
and my error is below
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 1358954496 bytes. Error code 12 (Cannot allocate memory)
- And I also try run model(facebook/opt-30b), the same error like above
My deepspeed version is 0.8.1 , my torch version is 1.13.1 and my transformer version is transformers==4.21.2. My CPU memory is 500GB
I follow the document to run my code.
- The below is my script
deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b --batch_size ${BS} --test_performance --dtype int8 --use_meta_tensorand
deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b --batch_size ${BS} --test_performance --dtype float16 --use_meta_tensor\my error is
File "inference-test.py", line 111, in <module> outputs = pipe(inputs, File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__ outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample) File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 115, in generate_outputs self.model.cuda().to(self.device) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply param_applied = fn(param) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in <lambda> return self._apply(lambda t: t.cuda(device)) NotImplementedError: Cannot copy out of meta tensor; no data! [2023-02-19 06:47:26,453] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12532 [2023-02-19 06:47:26,672] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12587 [2023-02-19 06:47:26,891] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12615 Traceback (most recent call last): File "inference-test.py", line 111, in <module> outputs = pipe(inputs, File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__ outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample) File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 115, in generate_outputs self.model.cuda().to(self.device) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply param_applied = fn(param) File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in <lambda>
- then I try another script
deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b --batch_size ${BS} --test_performance --dtype int8and
deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b --batch_size ${BS} --test_performance --dtype float16and my error is below
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 1358954496 bytes. Error code 12 (Cannot allocate memory)
- And I also try run model(facebook/opt-30b), the same error like above
All the Batch Size is 1
deepspeed 0.7.6 with transformers 4.25.1 works for me.
thanks, btw ,do you use zero to run deepspeed inference?
@lambda7xx, please see example here.
@lambda7xx, please see example bloom-ds-zero-inference.py.
I use this code to inference a bloom model, which is 176B model on 8 V100-32GB, The e2e time is 2000s, I think it's too large. And I use the inference-test.py to inference a 176B model with int8, the e2e time is just 10s.
I don't why
The reason is that inference-test.py with int8 is designed for latency-critical scenarios while zero-inference is for throughput-critical scenarios. zero-inference latency is higher because weights are offloaded to disk, while inference-test.py keeps weights in GPU memory.
Please see the following links for more details:
- https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-scripts
- https://www.deepspeed.ai/2022/09/09/zero-inference.html