aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Should we switch to different pod if model adapter load is failing for x times?

Open varungup90 opened this issue 4 months ago • 4 comments

Scenario 1

model adapter load is failing infinitely

I added an error in model load. As expected model is not loading. But problem is that, in model adapter it adds the pod it is suppose to load. It never switches to different pod to retry because it could be a pod issue also. image

image

Scenario 2

Deleted the first pod on which lora adapter fail was stuck. Now lora adapter loading moves to next pod as expected. It is failing on second pod as well because image is still bad with hardcoded error on load_lora. image

Scenario 3

Updated the image for all pods with lora adapter success. So deleted all pods and recreated them. It takes several mins but model adapter is re-scheduled.

image

varungup90 avatar Sep 27 '24 19:09 varungup90