infinity health endpoint does not really provide insights about healthiness

System Info

latest any platform

Information

[ ] Docker + cli
[ ] pip + cli
[ ] pip + usage of Python interface

Tasks

[ ] An officially supported CLI command
[ ] My own modifications

Reproduction

check the code https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/infinity_server.py#L173
query /health

Nov 28 '24 14:11 bufferoverflow

Fair - What would you suggest as better health point? Background: it’s a route that is always up for the Kubernetes / health uptime check. It proves that fastapi is ready to handle requests & fastapi has indeed started (all models got loaded without error). If e.g. asyncio got deadlocked, it would no longer respond. Beyond, you can measure latency of response.

Nov 28 '24 15:11 michaelfeil

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request

Nov 28 '24 15:11 michaelfeil

I agree fastapi is able to handle request but we had an out of memory issue and this is not recognized by the health endpoint. the metrics and health endpoints are still serving but embedding end point is death after an oom.

If health endpoint would check healthiness of inference the pod would be restarted.

ERROR    2024-11-28 13:24:44,248 infinity_emb ERROR: CUDA   batch_handler.py:574
         out of memory. Tried to allocate 3.88 GiB. GPU 0                       
         has a total capacity of 44.42 GiB of which 920.81                      
         MiB is free. Process 247326 has 5.95 GiB memory in                     
         use. Process 248104 has 37.55 GiB memory in use.                       
         Of the allocated memory 5.44 GiB is allocated by                       
         PyTorch, and 6.72 MiB is reserved by PyTorch but                       
         unallocated. If reserved but unallocated memory is                     
         large try setting                                                      
         PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True                       
         to avoid fragmentation.  See documentation for                         
         Memory Management                                                      
         (https://pytorch.org/docs/stable/notes/cuda.html#e                     
         nvironment-variables)                                                  
         Traceback (most recent call last):                                     
           File                                                                 
         "/app/infinity_emb/inference/batch_handler.py",                        
         line 563, in _core_batch                                               
             embed = self._model.encode_core(feat)                              
           File                                                                 
         "/app/infinity_emb/transformer/embedder/sentence_t                     
         ransformer.py", line 117, in encode_core                               
             out: dict[str, "Tensor"] =                                         
         self.forward(features)                                                 
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/sentence_                     
         transformers/SentenceTransformer.py", line 688, in                     
         forward                                                                
             input = module(input, **module_kwargs)                             
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/torch/nn/                     
         modules/module.py", line 1553, in                                      
         _wrapped_call_impl                                                     
             return self._call_impl(*args, **kwargs)                            
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/torch/nn/                     
         modules/module.py", line 1562, in _call_impl                           
             return forward_call(*args, **kwargs)                               
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/sentence_                     
         transformers/models/Transformer.py", line 350, in                      
         forward                                                                
             output_states =                                                    
         self.auto_model(**trans_features, **kwargs,                            
         return_dict=False)                                                     
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/torch/nn/                     
         modules/module.py", line 1553, in                                      
         _wrapped_call_impl                                                     
             return self._call_impl(*args, **kwargs)                            
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/torch/nn/                     
         modules/module.py", line 1562, in _call_impl                           
             return forward_call(*args, **kwargs)                               
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/transform                     
         ers/models/xlm_roberta/modeling_xlm_roberta.py",                       
         line 943, in forward                                                   
             extended_attention_mask =                                          
         _prepare_4d_attention_mask_for_sdpa(                                   
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/transform                     
         ers/modeling_attn_mask_utils.py", line 447, in                         
         _prepare_4d_attention_mask_for_sdpa                                    
             return                                                             
         AttentionMaskConverter._expand_mask(mask=mask,                         
         dtype=dtype, tgt_len=tgt_len)                                          
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/transform                     
         ers/modeling_attn_mask_utils.py", line 186, in                         
         _expand_mask                                                           
             inverted_mask = 1.0 - expanded_mask                                
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/torch/_te                     
         nsor.py", line 41, in wrapped                                          
             return f(*args, **kwargs)                                          
           File                                                                 
         "/app/.venv/lib/python3.10/site-packages/torch/_te                     
         nsor.py", line 962, in __rsub__                                        
             return _C._VariableFunctions.rsub(self, other)                     
         torch.OutOfMemoryError: CUDA out of memory. Tried                      
         to allocate 3.88 GiB. GPU 0 has a total capacity                       
         of 44.42 GiB of which 920.81 MiB is free. Process                      
         247326 has 5.95 GiB memory in use. Process 248104                      
         has 37.55 GiB memory in use. Of the allocated                          
         memory 5.44 GiB is allocated by PyTorch, and 6.72                      
         MiB is reserved by PyTorch but unallocated. If                         
         reserved but unallocated memory is large try                           
         setting                                                                
         PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True                       
         to avoid fragmentation.  See documentation for                         
         Memory Management                                                      
         (https://pytorch.org/docs/stable/notes/cuda.html#e                     
         nvironment-variables)                                                  
INFO:     172.18.0.6:60254 - "GET /metrics HTTP/1.0" 200 OK
INFO:     172.18.0.6:38224 - "GET /metrics HTTP/1.0" 200 OK

Nov 29 '24 06:11 bufferoverflow