Funtowicz Morgan
Funtowicz Morgan
When attempting to sparsify a transformers model, it appears for some reason `child_name` can be `None` and thus `fx_graph.get(None)` returns `None` and make the overall process crash. This PR attempts...
This PR aims at adding a new custom backends to TGI, namely Nvidia TensorRT-LLM. The underlying implementation is done through the use of a Rust C++ automatically generated binding living...
This PR introduces a new subpackage `optimum.tools.records` which aims at providing the bare-minimum infrastructure required to push performance metrics to our internal tracking system - [x] Pythonic API RFC -...
Current backend implementation relies on locking mecanism to access, within each tokio's requests context thread, the executor on the C++ side. This locking results in a heavy contention for all...
This PR attempts to fix building issue on GCC13 which is now shipped in all nvidia/cuda container images based on ubuntu-24.04. GCC13 now needs to include some additional headers compared...
This PR bumps some dependencies related to TensorRT-LLM alongside rebasing Docker container against ubuntu24.04 instead of ubuntu22.04. To support this, we need to use latest TensorRT-LLM main due to a...
This PR is an initial implementation of llama.cpp as potential backend for TGI. It mostly targets CPU inference in a single/multi stream scheduling fashion, potentially spawning multiple instances of the...