transformers-bloom-inference
transformers-bloom-inference copied to clipboard
[Bug] Int8 quantize inference failed using bloom-inference-scripts/bloom-ds-inference.py with deepspeed==0.9.0 on multi-gpus
I am using multi-gpu to quantize the model and inference with deepspeed==0.9.0, but failed.
Device: RTX-3090 x 8 Server
Docker: nvidia-pytorch-container which tag is 22.07-py3. Then git clone this codebase in docker.
Command:
deepspeed --include localhost:1,6 bloom-inference-scripts/bloom-ds-inference.py --local_rank=0 --name bigscience/bloomz-7b1-mt --dtype int8
ErrorLog:
Traceback (most recent call last):
File "bloom-inference-scripts/bloom-ds-inference.py", line 182, in <module>
model = deepspeed.init_inference(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 324, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 194, in __init__
self._apply_injection_policy(config)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 396, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 519, in replace_transformer_layer
load_model_with_checkpoint(replaced_module,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 243, in load_model_with_checkpoint
load_module_recursive(r_module)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 237, in load_module_recursive
load_module_recursive(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 237, in load_module_recursive
load_module_recursive(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 235, in load_module_recursive
layer_policies[child.__class__](child, prefix + name + '.')
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 173, in load_transformer_layer
container.load_params(module, sd[0], weight_quantizer, mp_replace, prefix)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/bloom.py", line 51, in load_params
maybe_copy(module.attention,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/policy.py", line 181, in maybe_copy
dst = mp_replace.copy(dst, weight_quantizer.quantize(tmp if weight_quantizer.q_int8 else \
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 111, in copy
dst.data.copy_(src[:, self.gpu_index * dst_shape[self.out_dim]: (self.gpu_index + 1) * dst_shape[self.out_dim]] if outer_dim == 1 else \
RuntimeError: The size of tensor a (6144) must match the size of tensor b (4096) at non-singleton dimension 1
There is no code changed, so I wonder why the code of multi-gpu int8 is failed, while multi-gpu with FP16 settings works fine.
Same error, how to solve it?