health endpoint does not really provide insights about healthiness
System Info
latest any platform
Information
- [ ] Docker + cli
- [ ] pip + cli
- [ ] pip + usage of Python interface
Tasks
- [ ] An officially supported CLI command
- [ ] My own modifications
Reproduction
- check the code https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/infinity_server.py#L173
- query
/health
Fair - What would you suggest as better health point? Background: it’s a route that is always up for the Kubernetes / health uptime check. It proves that fastapi is ready to handle requests & fastapi has indeed started (all models got loaded without error). If e.g. asyncio got deadlocked, it would no longer respond. Beyond, you can measure latency of response.
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request
I agree fastapi is able to handle request but we had an out of memory issue and this is not recognized by the health endpoint. the metrics and health endpoints are still serving but embedding end point is death after an oom.
If health endpoint would check healthiness of inference the pod would be restarted.
ERROR 2024-11-28 13:24:44,248 infinity_emb ERROR: CUDA batch_handler.py:574
out of memory. Tried to allocate 3.88 GiB. GPU 0
has a total capacity of 44.42 GiB of which 920.81
MiB is free. Process 247326 has 5.95 GiB memory in
use. Process 248104 has 37.55 GiB memory in use.
Of the allocated memory 5.44 GiB is allocated by
PyTorch, and 6.72 MiB is reserved by PyTorch but
unallocated. If reserved but unallocated memory is
large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
to avoid fragmentation. See documentation for
Memory Management
(https://pytorch.org/docs/stable/notes/cuda.html#e
nvironment-variables)
Traceback (most recent call last):
File
"/app/infinity_emb/inference/batch_handler.py",
line 563, in _core_batch
embed = self._model.encode_core(feat)
File
"/app/infinity_emb/transformer/embedder/sentence_t
ransformer.py", line 117, in encode_core
out: dict[str, "Tensor"] =
self.forward(features)
File
"/app/.venv/lib/python3.10/site-packages/sentence_
transformers/SentenceTransformer.py", line 688, in
forward
input = module(input, **module_kwargs)
File
"/app/.venv/lib/python3.10/site-packages/torch/nn/
modules/module.py", line 1553, in
_wrapped_call_impl
return self._call_impl(*args, **kwargs)
File
"/app/.venv/lib/python3.10/site-packages/torch/nn/
modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File
"/app/.venv/lib/python3.10/site-packages/sentence_
transformers/models/Transformer.py", line 350, in
forward
output_states =
self.auto_model(**trans_features, **kwargs,
return_dict=False)
File
"/app/.venv/lib/python3.10/site-packages/torch/nn/
modules/module.py", line 1553, in
_wrapped_call_impl
return self._call_impl(*args, **kwargs)
File
"/app/.venv/lib/python3.10/site-packages/torch/nn/
modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File
"/app/.venv/lib/python3.10/site-packages/transform
ers/models/xlm_roberta/modeling_xlm_roberta.py",
line 943, in forward
extended_attention_mask =
_prepare_4d_attention_mask_for_sdpa(
File
"/app/.venv/lib/python3.10/site-packages/transform
ers/modeling_attn_mask_utils.py", line 447, in
_prepare_4d_attention_mask_for_sdpa
return
AttentionMaskConverter._expand_mask(mask=mask,
dtype=dtype, tgt_len=tgt_len)
File
"/app/.venv/lib/python3.10/site-packages/transform
ers/modeling_attn_mask_utils.py", line 186, in
_expand_mask
inverted_mask = 1.0 - expanded_mask
File
"/app/.venv/lib/python3.10/site-packages/torch/_te
nsor.py", line 41, in wrapped
return f(*args, **kwargs)
File
"/app/.venv/lib/python3.10/site-packages/torch/_te
nsor.py", line 962, in __rsub__
return _C._VariableFunctions.rsub(self, other)
torch.OutOfMemoryError: CUDA out of memory. Tried
to allocate 3.88 GiB. GPU 0 has a total capacity
of 44.42 GiB of which 920.81 MiB is free. Process
247326 has 5.95 GiB memory in use. Process 248104
has 37.55 GiB memory in use. Of the allocated
memory 5.44 GiB is allocated by PyTorch, and 6.72
MiB is reserved by PyTorch but unallocated. If
reserved but unallocated memory is large try
setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
to avoid fragmentation. See documentation for
Memory Management
(https://pytorch.org/docs/stable/notes/cuda.html#e
nvironment-variables)
INFO: 172.18.0.6:60254 - "GET /metrics HTTP/1.0" 200 OK
INFO: 172.18.0.6:38224 - "GET /metrics HTTP/1.0" 200 OK