Ashwin Bharambe
Ashwin Bharambe
We should use Inference APIs to execute Llama Guard instead of directly needing to use HuggingFace APIs. The actual inference consideration is handled by Inference.
In the previous design, the server endpoint at the top-most level extracted the headers from the request and set provider data (e.g., private keys) that the implementations could retrieve using...
Most of the current inference providers only implement the `chat_completion()` method. The `completion()` method raises a `NotImplementedError`. We should implement this method for all the inference providers: - meta-reference -...
This PR makes several core changes to the developer experience surrounding Llama Stack. **Background:** PR https://github.com/meta-llama/llama-stack/pull/92 introduced the notion of "routing" to the Llama Stack. It introduces three object types:...
Added support for structured output in the API and added a reference implementation for meta-reference. A few notes: - Two formats are specified in the API: Json schema and EBNF...
# What does this PR do? Significantly simplifies running tests. Previously you ran tests by doing: ```bash MODEL_ID= PROVIDER_ID= PROVIDER_CONFIG=config.yaml pytest -s llama_stack/providers/tests/inference/test_inference.py ``` This was pretty annoying because -...
### System Info ... ### Information - [ ] The official example scripts - [ ] My own modified scripts ### 🐛 Describe the bug vLLM does not work when...
### 🚀 The feature, motivation and pitch (fireworks, together, meta-reference) support guided decoding (specifying a json-schema for example, as a "grammar" for decoding) with inference. vLLM supports this functionality --...
### 🚀 The feature, motivation and pitch We have a decently flexible testing system for testing various combinations of providers when composing a Llama Stack. See https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/tests/README.md We need to...