David Corvoysier
David Corvoysier
When using MarlinInt4WeightQBitsTensor and its associated optimized gemm kernel, there are issues with the weight/scales/zero-point readback as soon as parallelization increases. The consequence is that output features higher than 128...
When performing quantized matrix multiplication between `int8` weights on an AMD CPU, the results are different than those obtained when running the same operation on CUDA or on an Intel...
@kechan reported compilation failures when using quanto in Google Colab, both on CPU and GPU.
Since the introduction of mixed-precision fp16-int4 [MARLIN](https://github.com/IST-DASLab/marlin) (Mixed Auto-Regressive Linear) kernels by IST-DASLab, new mixed-precision MARLIN kernels have been introduced for other data types. In particular, mixed-precision fp16/bf16-int4/int8 kernels have...
[HuggingFace][Neuronx] Training - Optimum Neuron 0.0.25 - Neuron sdk 2.20.0 - Transformers to 4.43.2
Issue #4307 ### Description This PR creates Hugginface's PyTorch DLC for training on neuron-v2 devices (Trainium). By submitting this pull request, I confirm that my contribution is made under the...
## Overview of DLCs to update _Inference - Neuronx_ Dependencies versions: transformers: 4.43.2 torch: 2.1.2 aws-neuron-sdk: 2.20.0 optimum-neuron: 0.0.25 _Training - Neuronx_ Dependencies versions: transformers: 4.43.2 torch: 2.1.2 aws-neuron-sdk: 2.20.0...