Andrew Lapp

Results 222 comments of Andrew Lapp

Good issue, I've run into all of these problems. I disagree about `llama.cpp` though, there's no reason to by default include `llama-cpp-python` in downstream dependents such as vLLM. Additionally, `llama-cpp-python`...

@ahmed-moubtahij Yes, outlines only becomes the bottleneck after ~1,000 tokens/s, and `vllm` is substantially faster than `transformers` However, are you sure you set `device_map="cuda"`? Sounds like you might have been...

Good idea, this is the #1 metric people care about. - vLLM benchmark script: https://github.com/vllm-project/vllm/tree/main/benchmarks - TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance.md#benchmarking-per-model - llama.cpp: https://github.com/ggerganov/llama.cpp/blob/master/examples/llama-bench/README.md - TGI: https://github.com/huggingface/text-generation-inference/blob/main/benchmark/README.md

Can you please try `pip install git+https://github.com/lapp0/outlines@add-fsm-union-pin-core` and report back whether it works? This is the branch of a PR in progress which fixes the rust installation issue.

Thanks so much for helping me test it! I expect a new release soon with the mentioned branch included.

I assume this was with ExLlamaV2, or am I wrong? Good find.

I ran your reproduction script, thanks for informing us about this issue. Here are some samples of tokens (in byte format) which cause the Error in your model: - `\xef\xbf\xbd\xe2\x80\x9e`...

We could update `outlines/fsm/json_schema.py` to allow arbitrary order, however this would increase the complexity (and compilation time) of the FSM exponentially. Once we have CFG working this will be viable....

Not familiar with any of these other than FlexFlow unfortunately. Happy to include PRs for any of these if they are uniquely valuable inference engines.

While this doesn't fully implement your suggestion, you gave me the idea to make a sample-efficient track. https://github.com/lapp0/sample-efficient-nanogpt