DeepSpeed [BUG] use 8 32GB V100 and use_meta_tensor to inference big model. Cannot copy out of meta tensor; no data!

My deepspeed version is 0.8.1 , my torch version is 1.13.1 and my transformer version is transformers==4.21.2. My CPU memory is 500GB

I follow the document to run my code.

The below is my script

 deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b  --batch_size ${BS}    --test_performance --dtype int8 --use_meta_tensor

and

 deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b  --batch_size ${BS}    --test_performance --dtype float16 --use_meta_tensor\

my error is

  File "inference-test.py", line 111, in <module>
    outputs = pipe(inputs,
  File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__
    outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)
  File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 115, in generate_outputs
    self.model.cuda().to(self.device)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply
    param_applied = fn(param)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in <lambda>
    return self._apply(lambda t: t.cuda(device))
NotImplementedError: Cannot copy out of meta tensor; no data!
[2023-02-19 06:47:26,453] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12532
[2023-02-19 06:47:26,672] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12587
[2023-02-19 06:47:26,891] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12615
Traceback (most recent call last):
  File "inference-test.py", line 111, in <module>
    outputs = pipe(inputs,
  File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__
    outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)
  File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 115, in generate_outputs
    self.model.cuda().to(self.device)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply
    param_applied = fn(param)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in <lambda>

then I try another script

    deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b  --batch_size ${BS}    --test_performance --dtype int8

and

 deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b  --batch_size ${BS}    --test_performance --dtype float16

and my error is below

RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 1358954496 bytes. Error code 12 (Cannot allocate memory)

And I also try run model(facebook/opt-30b), the same error like above

Feb 19 '23 06:02 lambda7xx

My deepspeed version is 0.8.1 , my torch version is 1.13.1 and my transformer version is transformers==4.21.2. My CPU memory is 500GB

I follow the document to run my code.

The below is my script

 deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b  --batch_size ${BS}    --test_performance --dtype int8 --use_meta_tensor

and

 deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b  --batch_size ${BS}    --test_performance --dtype float16 --use_meta_tensor\

my error is

  File "inference-test.py", line 111, in <module>
    outputs = pipe(inputs,
  File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__
    outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)
  File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 115, in generate_outputs
    self.model.cuda().to(self.device)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply
    param_applied = fn(param)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in <lambda>
    return self._apply(lambda t: t.cuda(device))
NotImplementedError: Cannot copy out of meta tensor; no data!
[2023-02-19 06:47:26,453] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12532
[2023-02-19 06:47:26,672] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12587
[2023-02-19 06:47:26,891] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 12615
Traceback (most recent call last):
  File "inference-test.py", line 111, in <module>
    outputs = pipe(inputs,
  File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__
    outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)
  File "/home/YYYYY/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 115, in generate_outputs
    self.model.cuda().to(self.device)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply
    param_applied = fn(param)
  File "/home/YYYYY/DeepSpeedExamples/lib/python3.8/site-packages/torch/nn/modules/module.py", line 749, in <lambda>

then I try another script

    deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b  --batch_size ${BS}    --test_performance --dtype int8

and

 deepspeed --num_gpus 8 inference-test.py --name facebook/opt-66b  --batch_size ${BS}    --test_performance --dtype float16

and my error is below

RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 1358954496 bytes. Error code 12 (Cannot allocate memory)

And I also try run model(facebook/opt-30b), the same error like above

All the Batch Size is 1

Feb 19 '23 06:02 lambda7xx

deepspeed 0.7.6 with transformers 4.25.1 works for me.

Feb 23 '23 06:02 wptoux

thanks, btw ,do you use zero to run deepspeed inference?

Feb 24 '23 02:02 lambda7xx

@lambda7xx, please see example here.

Feb 24 '23 13:02 tjruwase

@lambda7xx, please see example bloom-ds-zero-inference.py.

I use this code to inference a bloom model, which is 176B model on 8 V100-32GB, The e2e time is 2000s, I think it's too large. And I use the inference-test.py to inference a 176B model with int8, the e2e time is just 10s.

I don't why

Feb 24 '23 13:02 lambda7xx

The reason is that inference-test.py with int8 is designed for latency-critical scenarios while zero-inference is for throughput-critical scenarios. zero-inference latency is higher because weights are offloaded to disk, while inference-test.py keeps weights in GPU memory.

Please see the following links for more details:

https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-scripts
https://www.deepspeed.ai/2022/09/09/zero-inference.html

Feb 24 '23 14:02 tjruwase

DeepSpeed DeepSpeed copied to clipboard

[BUG] use 8 32GB V100 and use_meta_tensor to inference big model. Cannot copy out of meta tensor; no data!

DeepSpeed
DeepSpeed copied to clipboard