nm-vllm
nm-vllm copied to clipboard
A high-throughput and memory-efficient inference and serving engine for LLMs
# Summary - Initial implementation for Compressed Config support + Activation Quantization for static per tensor w8a8 - Includes fused kernels added by @varun-sundar-rabindranath # Testing/Sample Script: ```python from vllm...
SUMMARY: * update NIGHTLY workflow to be whl centric * update benchmarking jobs to use generated whl TEST PLAN: runs on remote push. i'm also triggering NIGHTLY manually.
The goal of this PR is to support the weight loading from the compressed `safetensor` representation. The compressed `safetensor` representation has been introduced by Neural Magic, and implemented by @Satrat...
Introducing an end-to-end test case that verifies basic correctness of the vllm engine by comparing the tokens output by the vllm OpenAI server with tokens generated by the HuggingFace model...
pulls in parts of https://github.com/vllm-project/vllm/pull/3014 for now, only the forward method on LlamaMLP is tagged with the new backend. sample code to run (derived from examples/offline_inference.py): ``` from vllm import...
This PR uses the server framework from #200 and translates the action added by @mgoin in #166 into end-to-end tests to validate correctness via lm-eval-harness. I’ve set the PR to...
DO NOT MERGE Quantization WIP. Based off vLLM PR 1508 Quantized model used for dev/testing : https://huggingface.co/nm-testing/Nous-Hermes-Llama2-13b-smoothquant Base model : https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b Testing: Command : `python3 ./examples/offline_quantized_inference.py` Expected output with `tensor_parallel_size=1`...
Single Node Docker Compose Deployment - Creates `deploy` directory to house examples using reference architectures (we can house K8s examples here, new relic examples here, etc). I was thinking we...
Adds initial test framework for tests making requests against the server. Wraps a re-implementation of an existing `ServerRunner` in a context manager with some additional structured logging, and includes a...
FILL IN THE PR DESCRIPTION HERE FIX #xxxx (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- PR...