DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Tested the 2662 PR, It fails for GPTJ 6B and few others

Open sindhuvahinis opened this issue 2 years ago • 8 comments

Describe the bug We have tested PR again a few models https://github.com/microsoft/DeepSpeed/pull/2662

  • OPT 1.3B, 2 tp degree, fp16
  • OPT 13B, 4 tp degree, [fp16, int8]
  • OPT 30B, 8 tp degree [fp16, int8]
  • GPT NeoX 20B [fp16, int8]
  • GPTJ 6B

Our test involves 2 steps.

  • Load the model to the device and then generate the partitions and save them in a local directory.
    • While generating the partitions, the model can be loaded with or without meta tensor. (This affects the partition generation)
    • Meta tensors, have the same dimensions of the real tensor, but it contains no data.
  • Load back the generated partition files to run inference.

We used your test suite to test your models.

System info (please complete the following information):

  • 1 machine with 8 GPUS. NVIDIA A10G, 24GB memory per GPU
  • Ubuntu

Check the tables below to know the results we got upon testing.

Load in CPU fully with HF, save to DS sharded and load back

Model Partitions Dtype Result : Generate DS presharded checkpoints Result: Loaded back DS presharded and run inference.
OPT 1.3B 2 float 16 Successfully generates presharded checkpoint files Successfully load back the presharded checkpoints and run inference and generate outputs.
GPTJ 6B 4 float 16 Successfully generates presharded checkpoint files But loading back them returns the error. NotImplementedError: Cannot copy out of meta tensor; no data!Traceback (most recent call last): File "inference-test.py", line 57, in <module> pipe.model = deepspeed.init_inference(pipe.model, File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 129, in __init__ self.module.to(device) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 927, in to return self._apply(convert) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply module._apply(fn) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 602, in _apply param_applied = fn(param) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 925, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)NotImplementedError: Cannot copy out of meta tensor; no data!
GPT NeoX 20B 8 float 16 Successfully generates presharded checkpoint files Successfully load back the presharded checkpoints and run inference and generate outputs.
OPT 13B 4 int 8 Successfully generates presharded checkpoint files Loaded the pre-sharded checkpoints, throws error while generating outputs.Traceback (most recent call last):File "inference-test.py", line 88, in <module>outputs = pipe(inputs,File "/tmp/ws/models/utils.py", line 69, in __call__outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)File "/tmp/ws/models/utils.py", line 113, in generate_outputsoutputs = self.model.generate(**input_tokens, **generate_kwargs)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 537, in _generatereturn self.module.generate(*inputs, **kwargs)File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_contextreturn func(*args, **kwargs)File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 1422, in generatereturn self.sample(File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 2049, in samplenext_token_scores = logits_warper(input_ids, next_token_scores)File "/usr/local/lib/python3.8/dist-packages/transformers/generation_logits_process.py", line 92, in __call__scores = processor(input_ids, scores)File "/usr/local/lib/python3.8/dist-packages/transformers/generation_logits_process.py", line 233, in __call__indices_to_remove = scores < torch.topk(scores, top_k)[0][..., -1, None]RuntimeError: CUDA error: an illegal memory access was encounteredCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
OPT 30B 8 int 8 Successfully generates presharded checkpoint files Traceback (most recent call last):File "inference-test.py", line 88, in <module>outputs = pipe(inputs,File "/tmp/ws/models/utils.py", line 69, in __call__outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)File "/tmp/ws/models/utils.py", line 113, in generate_outputsoutputs = self.model.generate(**input_tokens, **generate_kwargs)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 537, in _generatereturn self.module.generate(*inputs, **kwargs)File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_contextreturn func(*args, **kwargs)File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 1422, in generatereturn self.sample(File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 2049, in samplenext_token_scores = logits_warper(input_ids, next_token_scores)File "/usr/local/lib/python3.8/dist-packages/transformers/generation_logits_process.py", line 92, in __call__scores = processor(input_ids, scores)File "/usr/local/lib/python3.8/dist-packages/transformers/generation_logits_process.py", line 233, in __call__indices_to_remove = scores < torch.topk(scores, top_k)[0][..., -1, None]RuntimeError: CUDA error: an illegal memory access was encounteredCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
GPT NeoX 20B 8 int8 Successfully generates presharded checkpoint files Loading them back throws the error. Traceback (most recent call last):File "inference-test.py", line 57, in <module>pipe.model = deepspeed.init_inference(pipe.model,File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inferenceengine = InferenceEngine(model, config=ds_inference_config)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 126, in __init__self._apply_injection_policy(config)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 339, in _apply_injection_policyreplace_transformer_layer(client_module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/replace_module.py", line 850, in replace_transformer_layerload_model_with_checkpoint(replaced_module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 252, in load_model_with_checkpointload_module_recursive(r_module)File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursiveload_module_recursive(File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursiveload_module_recursive(File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 244, in load_module_recursivelayer_policies[child.__class__](child, prefix + name + '.')File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 176, in load_transformer_layerload_parameters(child, prefix + n + '.')File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 85, in load_parametersassert tmp_data.dtype != torch.int8, \AssertionError: Merging of the checkpoints are not supported when using INT8 checkpoint! Please use a as many GPUs as TP-size for the checkpoint

Load with Meta Tensor, save to DS sharded and load back

Model Partitions Dtype Result : Generate DS presharded checkpoints Result: Loaded back DS presharded and run inference.
OPT 1.3B 2 float 16 Could not even generate presharded checpoints for the model. It generates the following error during init_inference API call. Traceback (most recent call last):File "inference-test.py", line 57, in <module>pipe.model = deepspeed.init_inference(pipe.model,File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inferenceengine = InferenceEngine(model, config=ds_inference_config)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 126, in __init__self._apply_injection_policy(config)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 339, in _apply_injection_policyreplace_transformer_layer(client_module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/replace_module.py", line 820, in replace_transformer_layerload_model_with_checkpoint(replaced_module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 252, in load_model_with_checkpointload_module_recursive(r_module)File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursiveload_module_recursive(File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursiveload_module_recursive(File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 244, in load_module_recursivelayer_policies[child.__class__](child, prefix + name + '.')File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 30, in loadmodule.weight = mp_replace.copy(module.weight.data, sd[0][prefix + 'weight'])KeyError: 'decoder.embed_tokens.weight' -
OPT 13B 4 float 16 Successfully generates presharded checkpoint files Successfully load back the presharded checkpoints and run inference and generate outputs.
OPT 30B 8 float 16 Successfully generates presharded checkpoint files Successfully load back the presharded checkpoints and run inference and generate outputs.
GPT J 6B 4 float 16 Successfully generates presharded checkpoint files Traceback (most recent call last):File "inference-test.py", line 57, in pipe.model = deepspeed.init_inference(pipe.model,File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 311, in init_inferenceengine = InferenceEngine(model, config=ds_inference_config)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 126, in __init__self._apply_injection_policy(config)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 339, in _apply_injection_policyreplace_transformer_layer(client_module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/replace_module.py", line 820, in replace_transformer_layerload_model_with_checkpoint(replaced_module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 252, in load_model_with_checkpointload_module_recursive(r_module)File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursiveload_module_recursive(File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursiveload_module_recursive(File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 244, in load_module_recursivelayer_policies[child.class](child, prefix + name + '.')File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 30, in loadmodule.weight = mp_replace.copy(module.weight.data, sd[0][prefix + 'weight'])KeyError: 'decoder.embed_tokens.weight'
GPT NeoX 20B 8 float 16 Could not even generate presharded checpoints for the model. It generates the following error during init_inference API call. Traceback (most recent call last):File "inference-test.py", line 57, in <module>pipe.model = deepspeed.init_inference(pipe.model,File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inferenceengine = InferenceEngine(model, config=ds_inference_config)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 126, in __init__self._apply_injection_policy(config)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 339, in _apply_injection_policyreplace_transformer_layer(client_module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/replace_module.py", line 820, in replace_transformer_layerload_model_with_checkpoint(replaced_module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 252, in load_model_with_checkpointload_module_recursive(r_module)File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursiveload_module_recursive(File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursiveload_module_recursive(File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 244, in load_module_recursivelayer_policies[child.__class__](child, prefix + name + '.')File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 178, in load_transformer_layerreplace_policy.load_params(module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/replace_policy.py", line 864, in load_paramsmaybe_copy(module.attention,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/replace_policy.py", line 250, in maybe_copydst = mp_replace.copy(dst, weight_quantizer.quantize(tmp if weight_quantizer.q_int8 else \File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/replace_module.py", line 116, in copydst = dst.reshape(-1).data.copy_(weight_split.reshape(-1)).reshape( -
OPT 13B 4 int 8 Successfully generates presharded checkpoint files ```Traceback (most recent call last):File "inference-test.py", line 88, in outputs = pipe(inputs,File "/tmp/ws/models/utils.py", line 69, in __call__outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)File "/tmp/ws/models/utils.py", line 113, in generate_outputsoutputs = self.model.generate(**input_tokens, **generate_kwargs)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 537, in _generatereturn self.module.generate(*inputs, **kwargs)File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_contextreturn func(*args, **kwargs)File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 1422, in generatereturn self.sample(File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 2035, in sampleoutputs = self(File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_implreturn forward_call(*input, **kwargs)File "/usr/local/lib/python3.8/dist-packages/transformers/models/opt/modeling_opt.py", line 935, in forwardoutputs = self.model.decoder(File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_implreturn forward_call(*input, **kwargs)File "/usr/local/lib/python3.8/dist-packages/transformers/models/opt/modeling_opt.py", line 699, in forwardlayer_outputs = decoder_layer(File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_implreturn forward_call(*input, **kwargs)File "/usr/local/lib/python3.8/dist-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 153, in forwardself.attention(input,File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_implreturn forward_call(*input, **kwargs)File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 147, in forwarddist.all_reduce(output, group=self.mp_group)File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 127, in log_wrapperreturn func(*args, **kwargs)File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 535, in all_reducereturn cdb.all_reduce(tensor, op, group, async_op)File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/torch.py", line 45, in all_reducereturn torch.distributed.all_reduce(tensor=tensor,File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reducework = group.allreduce([tensor], opts)RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
OPT 30B 8 int 8 Successfully generates presharded checkpoint files ```Traceback (most recent call last):File "inference-test.py", line 88, in outputs = pipe(inputs,File "/tmp/ws/models/utils.py", line 69, in __call__outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)File "/tmp/ws/models/utils.py", line 113, in generate_outputsoutputs = self.model.generate(**input_tokens, **generate_kwargs)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 537, in _generatereturn self.module.generate(*inputs, **kwargs)File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_contextreturn func(*args, **kwargs)File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 1422, in generatereturn self.sample(File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 2035, in sampleoutputs = self(File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_implreturn forward_call(*input, **kwargs)File "/usr/local/lib/python3.8/dist-packages/transformers/models/opt/modeling_opt.py", line 935, in forwardoutputs = self.model.decoder(File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_implreturn forward_call(*input, **kwargs)File "/usr/local/lib/python3.8/dist-packages/transformers/models/opt/modeling_opt.py", line 699, in forwardlayer_outputs = decoder_layer(File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_implreturn forward_call(*input, **kwargs)File "/usr/local/lib/python3.8/dist-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 153, in forwardself.attention(input,File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_implreturn forward_call(*input, **kwargs)File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 147, in forwarddist.all_reduce(output, group=self.mp_group)File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 127, in log_wrapperreturn func(*args, **kwargs)File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 535, in all_reducereturn cdb.all_reduce(tensor, op, group, async_op)File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/torch.py", line 45, in all_reducereturn torch.distributed.all_reduce(tensor=tensor,File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reducework = group.allreduce([tensor], opts)RuntimeError: CUDA error: an illegal memory access was encounteredCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
GPT NeoX 20B 8 int 8 Successfully generates checkpoint files. Error occurs when we load back the presharded checkpoint files. Traceback (most recent call last):File "inference-test.py", line 57, in <module>pipe.model = deepspeed.init_inference(pipe.model,File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inferenceengine = InferenceEngine(model, config=ds_inference_config)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 126, in __init__self._apply_injection_policy(config)File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 339, in _apply_injection_policyreplace_transformer_layer(client_module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/replace_module.py", line 850, in replace_transformer_layerload_model_with_checkpoint(replaced_module,File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 252, in load_model_with_checkpointload_module_recursive(r_module)File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursiveload_module_recursive(File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursiveload_module_recursive(File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 244, in load_module_recursivelayer_policies[child.__class__](child, prefix + name + '.')File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 176, in load_transformer_layerload_parameters(child, prefix + n + '.')File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 85, in load_parametersassert tmp_data.dtype != torch.int8, \AssertionError: Merging of the checkpoints are not supported when using INT8 checkpoint! Please use a as many GPUs as TP-size for the checkpoint

A quick summary in words.

  • For OPT 1.3B and GPT NeoX 20B with dtype float 16, We COULD NOT EVEN GENERATE the partition files when we load the model with meta tensor.
  • For GPTJ 6B with dtype float16, we are able to generate the pre-sharded checkpoint files with/without meta tensor. But loading them back generates the error.
  • For int8 dtype,
    • For all OPT 13B, OPT 30B models, We could generate the pre-sharded checkpoint files when the model is loaded without meta tensor, i.e the traditional way, but generating the output after loading the model throws error.
    • For GPT NeoX 20B model, we could generate the pre-sharded checkpoint files in both with and without meta tensor. But loading them back throws the error.

The major problem here are the following:

  • GPTJ presharded checpoint files generated does not work.
  • Presharded checkpoints generation with int8 quantization does not work as well.

sindhuvahinis avatar Jan 31 '23 03:01 sindhuvahinis

@RezaYazdaniAminabadi @lekurile we did some experiment based on your PR. It worked for a few cases and doesn't work on some corner cases. I would suggest we merge the PR given some use cases works and let's fix the remaining one with follow up prs :)

BLOOM series was not covered since it is more like "known to work"

lanking520 avatar Jan 31 '23 17:01 lanking520

Hello @sindhuvahinis @lanking520, thank you for reporting this! With the merge of https://github.com/microsoft/DeepSpeed/pull/2725, the major part of this issue should have been resolved. I tested the models you listed with the master branch of DeepSpeed with meta tensor and int8 checkpoint loading. These models run smoothly. The GPTJ 6B model gives a different result. I believe it is another issue and I am actively investigating it. Could you do a quick check on your side to see if you still have this issue with the current master branch of DeepSpeed? Thank you!

HeyangQin avatar Feb 21 '23 21:02 HeyangQin

@HeyangQin we did some tests on 2725 as well and still observing the major issues with INT8. Will share more details and setup

lanking520 avatar Feb 21 '23 22:02 lanking520

@HeyangQin we did some tests on 2725 as well and still observing the major issues with INT8. Will share more details and setup

@lanking520 Thank you for the update! If possible, could you share the command line you use to reproduce this issue?

HeyangQin avatar Feb 21 '23 22:02 HeyangQin

Thanks for the update @HeyangQin As Qing said, we also tested #2725. You can check the comments in the PR the error we faced. we tested in multiple GPU size with tp_size more than 1.

sindhuvahinis avatar Feb 21 '23 22:02 sindhuvahinis

@HeyangQin We used the same test suite as yours. For example, for GPT-NeoX. The way we tested is we generated the checkpoints using save_mp_checkpoint_path first and then loaded it back using meta tensor and checkpoint file.

deepspeed --num_nodes 1 \
    --num_gpus 8 \
    inference-test.py \
    --use_kernel \
    --ds_inference \
    --use_meta_tensor \
    --name EleutherAI/gpt-neox-20b \
    --checkpoint_path /tmp/ws/gpt-neox-20b/ \
    --save_mp_checkpoint_path /tmp/ws/sharded-gpt-neox-20b/ \
    --dtype int8

deepspeed --num_nodes 1 \
    --num_gpus 8     \
    inference-test.py     \
    --use_kernel     \
    --ds_inference     \
    --use_meta_tensor \
    --name EleutherAI/gpt-neox-20b     \
    --checkpoint_path /tmp/ws/sharded-gpt-neox-20b/ \
    --dtype int8

sindhuvahinis avatar Feb 21 '23 22:02 sindhuvahinis

Similar test could be conducted quickly on OPT/GPTJ/GPT-Neox/BLOOM 7B INT8, these models are all producing garbage outputs.

  • OPT model is NCCL communication issue
  • GPT-NeoX 20B is producing garbage
  • BLOOM-7B: shape '[1, 4, 32, 384]' is invalid for input of size 16384

Just tried these models on DeepSpeed 0.8.1

lanking520 avatar Feb 21 '23 22:02 lanking520

Maybe we could close this issue since Meta tensor and checkpoint loading for other precision type is mostly working (FP16/32). And open one for INT8 specifically. @HeyangQin what do you think?

lanking520 avatar Feb 22 '23 05:02 lanking520

Hi @lanking520 @sindhuvahinis, Thank you for the information. Previously I only tested checkpoint loading with int8. Now when I test checkpoint saving with int8, I see the same error as you reported. After some initial investigation, I think there are multiple underlying issues that caused these errors:

  1. DeepSpeedExample tries to load checkpoint even if they don't exist. I fixed it by https://github.com/microsoft/DeepSpeedExamples/commit/efacebb3ddbea86bb20c3af30fd060be0fa41ac8
  2. load_params() should reside in policy. I fixed it by https://github.com/microsoft/DeepSpeed/pull/2875. I will merge this once it is reviewed.
  3. Kernel issues. I am working on this.

Once the int8 checkpoint saving works, I plan to add unit tests to prevent such errors in the future. I agree with @lanking520 that opening a new issue would make it more organized as this issue is about a PR that has been merged.

HeyangQin avatar Feb 22 '23 18:02 HeyangQin

Sounds good, @sindhuvahinis let's close this issue and make a different issue titled :

[0.8.1] INT8 model loading/inference issue

And summarize the finding

lanking520 avatar Feb 22 '23 20:02 lanking520