SeanHH86 comments

Results 8 comments of


                                            SeanHH86

Requested tokens (817) exceed context window of 512

``` deployment_config: autoscaling_config: min_replicas: 1 initial_replicas: 1 max_replicas: 8 target_num_ongoing_requests_per_replica: 1.0 metrics_interval_s: 10.0 look_back_period_s: 30.0 smoothing_factor: 1.0 downscale_delay_s: 300.0 upscale_delay_s: 90.0 ray_actor_options: num_cpus: 4 # for a model deployment, we...

Support Quantized Model

Inference's speed is slow

Enhance inference API to support OpenAI style

OpenAI API : https://platform.openai.com/docs/api-reference/introduction - Organizations and projects (optional) ``` curl https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY" -H "OpenAI-Organization: YOUR_ORG_ID" -H "OpenAI-Project: $PROJECT_ID" ``` - List models: GET https://api.openai.com/v1/models ``` curl...

SeanHH86

Requested tokens (817) exceed context window of 512

Support Quantized Model

Enhance inference API to support OpenAI style

Enhance inference API to support OpenAI style

Enhance inference API to support OpenAI style

Install dependency llama-cpp-python failed

Install dependency llama-cpp-python failed

Install dependency llama-cpp-python failed