Andrei Panferov
Andrei Panferov
# What does this PR do? Fixes the default value of `modules_to_not_convert` of `utils.bitsandbytes.replace_8bit_linear`. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you...
`bfloat16` is not supported on T4 and GPU with the same or lower Compute Capability, meaning the kernels will throw an error compiling. This PR isolates the code behind CC...
We could use more predefined configs for better user experience. Models to consider: - [x] bert - [x] gpt2 - [ ] LLaMa - [ ] electra - [ ]...
Some models already rely on devices interactions. I propose we don't wrap them and throw error. Possible examples: * Wrapping a model that is already wrapped * Wrapping an 'accelerate'...
Hi! I'm trying to integrate some of quantized MatMul C++ kernels into Executorch and I'm having a bad time: the documentation is very vague about what exactly I need to...
# This PR adds support for the Quartet QAT method. The goal of this PR is to integrate inference and training support for the [Quartet QAT method](https://arxiv.org/abs/2505.14669). That would allow...
## Summary When training `pretrain_gpt.py` with sequence packing enabled (`--reset-position-ids` and `--reset-attention-mask`) and using the `--transformer-impl transformer_engine` backend, the custom block-diagonal attention mask generated by `GPTDataset` is effectively ignored. The...
## 🐞Describing the bug I'm roughly following [this guide](https://machinelearning.apple.com/research/core-ml-on-device-llama) on LLM exporting. I adjusted the input names to be able to use it with this [HF demo](https://github.com/huggingface/swift-chat). I also added...