akshay-anyscale
akshay-anyscale
Ready to merge. pending @aslonnie 's approval
hi @lamhoangtung can you try using the serve run command instead. You can refer to the readme here for example usage - https://github.com/ray-project/ray-llm
Can you share the model yamls that you are using? You'll need to set num_gpus_per_worker to 0.5 for both
Are you looking for fine-tuning LLMs? RayLLM currently is only meant for inference but we do have examples for how to do fine-tuning using Ray - https://docs.ray.io/en/latest/train/examples/lightning/vicuna_13b_lightning_deepspeed_finetune.html#fine-tune-vicuna-13b-with-lightning-and-deepspeed
Can you provide the code you are using for querying?
What models are you using?
yes you should be able to setup observability using the ray serve general guides. For the custom metrics, you can use "ray_aviary" to search the metrics (eg. if you're using...
Hi @roelschr I believe this is because we do some batching(upto 100ms) to make the streaming more efficient. If you make the denominator "ray_aviary_tokens_generated" instead, this should be closer to...
try using serve run serve_configs/meta-llama--Llama-2-7b-chat-hf.yaml . I'll fix the docs to reflect that
Docs fixed here https://github.com/ray-project/ray-llm/pull/85