ray
ray copied to clipboard
[Serve] Make batching work with multiplexing
fixes https://github.com/ray-project/ray/issues/56633
- [x] Add documentation
- [x] update
get_multiplexed_model_idto see if we are batch context first - [x] update logic
- [x] add tests
- [x] does not introduce any backwards incompatibility, previously the system did not provide any guarantee about contents of a batch and now we are add a constraint that guarantees each batch contains requests for same model.
The thing I dislike about this implementation is that it does not fill the batch in the case where the replica is responsible for > 2 models and incoming traffic is equally distributed between those models. Becasue the current implementation fills the batch first, then divides them.
[!WARNING] You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!