sparseml icon indicating copy to clipboard operation
sparseml copied to clipboard

[MOE Quantization] Update transformers version to 4.40.0

Open dbogunowicz opened this issue 1 year ago • 1 comments

We need to update the transformers version to support QWEN2-MOE model, see: https://github.com/huggingface/transformers/releases/tag/v4.40.0 (it also, fits into our goal to be constantly matching the latest release)

Important changes

By default, untie the vocab embedding weights

transformers 4.40.X has now difficulty saving post one-shot models. This applies to models that are 1) quantized using new (not vLLM) QuantizationModifiers 2) in the "fakequant" state. HF developers made changes to the internal logic of parsing "tied weights" (such as embedding and lm_head modules) on save_pretrained. Those decisions are buried quite deep in the transformer's codebase. My solution is to untie the weights for the one-shot models on initialization. Benefits: It requires minimal changes on our side, and does not influence the size of saved models on disk (safe-tensors untie weights as well). Downsides: The one-shot process might be slightly less performant because of doubling the memory required to store the embedding layer in CUDA.

Adjust the related tests (as a result of untied weights)

The overall sparsity, as reported in the tests, will be reduced now. The calculations below justify the change:

num_parameters_embedding_layer: 9216000

------ NEW CALCULATIONS ------
num_parameters_total_new: 24407712
number_zero_parameters_new: 2986128
global_sparsity_new: 0.1222

----- OLD CALCULATIONS
number_zero_parameters_old = 2986128 - 0 = 2986128 (we are not pruning word embeddings; number of zero parameters remains constant)
num_parameters_total_old = num_parameters_total_new - num_parameters_embedding_layer = 15191712 (as if we had "one less" word embedding matrix)
global_sparsity_old = number_zero_parameters_old  / num_parameters_total_old = 0.196

dbogunowicz avatar May 06 '24 11:05 dbogunowicz

@mgoin the failure in export-tests looks transient, just fyi

dbogunowicz avatar May 07 '24 10:05 dbogunowicz