Nicolas Patry comments

Results 977 comments of


                                            Nicolas Patry

Refine model file download for python backend

Thanks for surfacing this, discussing internally we figured there are some security implications of this behavior, which we're most likely going to close, so this behavior will go away (and...

Saving tensors should be zero-copy

Do you mind sharing why consuming 2x memory is an issue for you ? Adding context is likely to help others as well. In general for GPU, the CPU RAM...

Crate Maintenance

Hey, I was indeed off of this crate for quite a long time, because I just didn't need it anymore. > Simple library to listen and send events globally to...

Error while building TGI from source

Sorry the issue obviously occurs in vLLM from source build which is hard to debug for particular individual setups. We're ditching vllm as a dependency anyway so it should be...

Error while building TGI from source

Also we're relying more and more on `nix` in order to speedup our buildtimes and makes us have less headaches around builds. ``` # Install cuda (system-wide, no conda otherwise...

[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes

| Metric Name | Type | Unit | Implemented by TGI Already | | -------------------------------------------------------- | --------- | ------------ | -------------------------------------------- | | model_load_time | Counter | Seconds | |...

[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes

By the way, on the topic of monitoring, we're slowly but surely moving to a different schedule mechanism whose goal is to maximize compute occupancy. https://github.com/huggingface/text-generation-inference/pull/1940 Basically we might not...

[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes

> depending on implementation could be a superset of queue time. Makes sense. > KV cache during decoding which, a Okay, this doesn't happen in TGI. Essentially vllm is doing...

[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes

> Batch size and TPOT are available separately. But they are bucketized individually which makes deriving TPOT per batch size infeasible. The reason to have this info is to understand...

[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes

> How do you intend to measure free compute? Well the scheduler knows everything (past tokens for each query, number of running queries, available vram). The theoretical max is known...