Nicolas Patry

Results 977 comments of Nicolas Patry

Thanks for surfacing this, discussing internally we figured there are some security implications of this behavior, which we're most likely going to close, so this behavior will go away (and...

Do you mind sharing why consuming 2x memory is an issue for you ? Adding context is likely to help others as well. In general for GPU, the CPU RAM...

Hey, I was indeed off of this crate for quite a long time, because I just didn't need it anymore. > Simple library to listen and send events globally to...

Sorry the issue obviously occurs in vLLM from source build which is hard to debug for particular individual setups. We're ditching vllm as a dependency anyway so it should be...

Also we're relying more and more on `nix` in order to speedup our buildtimes and makes us have less headaches around builds. ``` # Install cuda (system-wide, no conda otherwise...

| Metric Name | Type | Unit | Implemented by TGI Already | | -------------------------------------------------------- | --------- | ------------ | -------------------------------------------- | | model_load_time | Counter | Seconds | |...

By the way, on the topic of monitoring, we're slowly but surely moving to a different schedule mechanism whose goal is to maximize compute occupancy. https://github.com/huggingface/text-generation-inference/pull/1940 Basically we might not...

> depending on implementation could be a superset of queue time. Makes sense. > KV cache during decoding which, a Okay, this doesn't happen in TGI. Essentially vllm is doing...

> Batch size and TPOT are available separately. But they are bucketized individually which makes deriving TPOT per batch size infeasible. The reason to have this info is to understand...

> How do you intend to measure free compute? Well the scheduler knows everything (past tokens for each query, number of running queries, available vram). The theoretical max is known...