aibrix
aibrix copied to clipboard
Should we switch to different pod if model adapter load is failing for x times?
Scenario 1
model adapter load is failing infinitely
I added an error in model load. As expected model is not loading. But problem is that, in model adapter it adds the pod it is suppose to load. It never switches to different pod to retry because it could be a pod issue also.
Scenario 2
Deleted the first pod on which lora adapter fail was stuck. Now lora adapter loading moves to next pod as expected. It is failing on second pod as well because image is still bad with hardcoded error on load_lora.
Scenario 3
Updated the image for all pods with lora adapter success. So deleted all pods and recreated them. It takes several mins but model adapter is re-scheduled.