skypilot
skypilot copied to clipboard
[Serve][UX] Fine-grained reason for a replica failure
Currently our recommendation to users is to check logs if the replica fails, but it might be easier to understand/debug if we can clearly state in the sky status output.
Potential solutions:
- More replica status, e.g.
RUN_FAILED,SETUP_FAILED; - Add a column to indicate failure reason in a plain text string.
you can run sky serve logs <service_name> <replica_id> to see the logs of the failing replica, but i agree that skypilot should have a status for failed build or stop creating new replicas if a max number of attempts was made
Already resolved by #3411. Closing now