sparseml
sparseml copied to clipboard
Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Replace quant_min/max values with INT8/UINT8 ranges so ONNX export is supported for other bit widths (e.g., 4).
This PR adds a transformation that quantizes weights for weights-only quantization. It was tested on a Llama2 model.
While applying modifiers, we utilize the module's device to set the correct device for the additional modules/buffers like fake quantization modules. Presently, we default to cpu in case the module...
This PR adds support for per-token dynamic quantization. Quantization scales and zero points are computed "on-the-fly" for each new tensor. Each token has its own quantization scale and zero-point (one...
The file_path was being joined twice to the files leading to wrong paths (lines 663 and 683). This PR removes duplicate joining of path.
Example recipe for what this enables: ``` OutputDistillationModifier: targets: ['layer.1', 'layer.2'] transforms: [] comparison: "square_head" orig_scale: 1.0 distill_scale: 1.0 ```
Define separate AddInput classes for the different branches of the bottleneck. Allows control of quantization via module class in addition to name
This PR enables exporting encoder-decoder models such as FLAN-T5 using sparseml pathway. Tested on FLAN-T5 models.