what is the right way to get a 8-bit model?
1、which version or branch of marian to complie? marian-master, marian-dev or https://github.com/afaji/Marian/tree/fixed-quant? 2、could the teacher model trained by teh marian-master?
fixed-quant is dead since it's been merged into master and @afaji needs to pay attention to issue #24 to remove references to it.
You can get a slower 8-bit model in https://github.com/marian-nmt/marian-dev without output matrix quantization or get a faster 8-bit model from https://github.com/browsermt/marian-dev .
The teacher can be trained with anything.
you mean i can get a fixed-quant(16-bit or 8-bit) model by adding the command "--quantize-bits 16" using marian master
@XapaJIaMnu the 8-bit documentation is lacking.
@yandaowei sorry for the lacking documentation, could you please check the steps described here: https://github.com/browsermt/students/tree/master/train-student#5-optional-8bit-quantization and see if everything is clear.
The quantisation finetuning is completely optional and is described here https://github.com/browsermt/students/tree/master/train-student/finetune Basically what it does it takes a fp32 model and damages it in such a way that would correspond to a quantised model. It achieves this by limiting the GEMM outputs to only 255 distinct numbers (in the 8bit case) or 65535 distinct numbers (in the 32bit case). When training for a bit with this scheme the model learns to work with reduced set of distinct values and performs better when quantised.
So I need two step to get the 8-bit model : Step1: do as the doc "https://github.com/browsermt/students/tree/master/train-student#5-optional-8bit-quantization" described to get the 8-bit model Step2: finetune the 8-bit model as "https://github.com/browsermt/students/tree/master/train-student/finetune" described is that correct?
- Train an FP32 model as usual.
- Optional: finetune the FP32 model with 8-bit damage. This step is mostly only useful if the model is particularly small (on the order of tiny11 in our WNGT paper). If it's larger, then rounding to 8-bit works out of the box.
- Optional: if using any of the 8-bit quantization methods that pre-compute the scaling factor for activations (these contain
Alphain the command line for the next step), run a sample decode through to tune the scaling factor. - Convert to 8-bit format.