nm-vllm
nm-vllm copied to clipboard
A high-throughput and memory-efficient inference and serving engine for LLMs
GOALS • Whisper support • Exemplifies encoder/decoder (E/D) support • E/D K/V caching • E/D parallelism TESTING • HuggingFace whisper model • Replicate public English Speech Recognition (SR) test using...
install release version of nm-magic-wand
Make ROCM rounding match Torch.
**SUMMARY** Fix the Python 3.8 compatibility in one of our benchmarking scripts. **TEST PLAN** The following command should complete successfully in a py38 environment: ```shell python \ -m neuralmagic.benchmarks.run_benchmark_serving \...
Upstream sync 2024 05 25 (#249) SUMMARY: Merge commits from https://github.com/vllm-project/vllm/commit/c7f2cf2b7f67bce5842fedfdba508440fe257375 to https://github.com/vllm-project/vllm/commit/f68470e803df575f294e67167b4b83adfe004cfa Note that https://github.com/vllm-project/vllm/commit/c7f2cf2b7f67bce5842fedfdba508440fe257375 is NOT included in this merge. --- PR Checklist (Click to Expand) Thank you...
Introducing an end-to-end test case that verifies basic correctness of the vllm server by comparing the tokens output by the vllm OpenAI server with tokens generated by the HuggingFace model...
FILL IN THE PR DESCRIPTION HERE FIX #xxxx (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- PR...
# Summary - Add a `CompressedTensorsW8A8DynamicToken` scheme to support dynamic-per token activation quantization - Update config parsing to support updates made to the `config.json` / quantization config provided with the...
Description: - Based on hyper-parameter sweep done on A6000 machines, I find the MxNxK block size of `128x128x64` and a stage count of `5` to be the most performant.
Add a test to make sure that magic_wand is an optional dep when sparsity is not required.