fairseq
fairseq copied to clipboard
How to convert model trained with QuantNoise to quantized model runnable on CPU only?
What is your question?
I successfully trained a translation model with scalar quantization and I am wondering how to convert the fake INT8 weights to regular INT8 weights? When trying to use the model on CPU (without CUDA) I am getting CUDA errors. Any insight is greatly welcome.
Code
Generation
#!/bin/bash
checkpoint=$1
fairseq-generate binarized --gen-subset test \
--source-lang src --target-lang tgt --cpu \
--path $checkpoint --beam 1 --nbest 1
this results in this exception
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/tasks/fairseq_task.py", line 434, in inference_step
models, sample, prefix_tokens=prefix_tokens, constraints=constraints
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/sequence_generator.py", line 177, in generate
return self._generate(sample, **kwargs)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/sequence_generator.py", line 237, in _generate
encoder_outs = self.model.forward_encoder(net_input)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/sequence_generator.py", line 806, in forward_encoder
return [model.encoder.forward_torchscript(net_input) for model in self.models]
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/sequence_generator.py", line 806, in <listcomp>
return [model.encoder.forward_torchscript(net_input) for model in self.models]
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/models/fairseq_encoder.py", line 55, in forward_torchscript
return self.forward_non_torchscript(net_input)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript
return self.forward(**encoder_input)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/models/transformer.py", line 413, in forward
x, encoder_embedding = self.forward_embedding(src_tokens, token_embeddings)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/models/transformer.py", line 372, in forward_embedding
token_embedding = self.embed_tokens(src_tokens)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/modules/quantization/scalar/modules/qemb.py", line 106, in forward
zero_point=self.zero_point,
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/modules/quantization/scalar/ops.py", line 11, in emulate_int
return q(w, scale=scale, zero_point=zero_point)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/modules/quantization/scalar/ops.py", line 25, in emulate_int8_histogram
scale = scale.cuda().type_as(w)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/torch/cuda/__init__.py", line 208, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
What have you tried?
I then tried to 1) Forcing the model into inference and saving it:
@dataclass
class NoiseConfig:
quant_noise_scalar = 1.0
checkpoint = torch.load(args.checkpoint, map_location=torch.device('cpu'))
model = quantization_utils.quantize_model_scalar(model, NoiseConfig())
model.eval()
# is there a simpler way?
checkpoint["model"] = model.state_dict()
torch.save(checkpoint, args.out)
this reduces the model size on disk, but there was no change. 2) Quantizing the model with torch directly
model_int8 = torch.quantization.quantize_dynamic(
model.eval(), # the original model
{torch.nn.Linear}, # a set of layers to dynamically quantize
dtype=torch.qint8)
Running inference then yielded unexpected items in state dict:
Traceback (most recent call last):
File "/home/memsource/miniconda3/envs/robo/bin/fairseq-generate", line 8, in <module>
sys.exit(cli_main())
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq_cli/generate.py", line 379, in cli_main
main(args)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq_cli/generate.py", line 41, in main
return _main(args, sys.stdout)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq_cli/generate.py", line 94, in _main
num_shards=args.checkpoint_shard_count,
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 256, in load_model_ensemble
num_shards,
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 287, in load_model_ensemble_and_task
model.load_state_dict(state["model"], strict=strict, args=args)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/fairseq/models/fairseq_model.py", line 99, in load_state_dict
return super().load_state_dict(new_state_dict, strict)
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for TransformerModel:
Unexpected key(s) in state_dict: "decoder.output_projection.scale", "decoder.output_projection.zero_point", "decoder.output_projection._packed_params.dtype", "decoder.output_projection._packed_params._packed_params".
Torch FX quantization
qconfig_dict = {"": torch.quantization.default_dynamic_qconfig}
# prepare
model_prepared = quantize_fx.prepare_fx(model, qconfig_dict)
# no calibration needed when we only have dynamici/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)
Raised exception
File "/home/memsource/miniconda3/envs/robo/lib/python3.7/site-packages/torch/fx/_symbolic_trace.py", line 381, in path_of_module
raise NameError('module is not installed as a submodule')
What's your environment?
- fairseq 0.10.2
- PyTorch 1.10.0
- OS Ubuntu 20.04
- How you installed fairseq pip
- Python version: 3.7
- CUDA/cuDNN N/A:
- transformer_wmt_en_de_big
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Antistale comment...
Hi @stolam I have also tried iPQ, but got the same problem that the quantized model cannot run only on CPU. I also tried torch dynamic_quantization, but encounter a lot of bugs in torch/nn/functional.py. I am wondering if you have got a good quantization method that can run in CPU?