pai
pai copied to clipboard
Reserve container after job failed.
In protocol v2, we remove the debug mode
. User can not reserve the container after the job has finished(failed or succeed). We need to think about if we need to add this feature back.
Several things need to consider.
- If remote-development can satisfy user requirement
- How to prevent consume too much resource after this feature is enabled
refer to: https://github.com/microsoft/pai/issues/3598
Add my vote for adding it back.
In previous version. When container is reserved for debug purpose, the job still in RUNNING status. It cause a lot of confusing for admin & users. To add this feature back, we should expose more runtime lifecycle status. Then other service can use this information and show accurate status to end user.
The runtime status currently contains:
- Executing pre command
- Executing user command
- Executing post command
- Collecting exist info
And we should add Reserved for debugging
status.
And the status will write to a predefined file, such as /pai/runtime-status
, For each status change, runtime will update the file. File format can be:
status:
- name: statusName
startTimestamp: status start timestamp
endTimestamp: status end timestamp
message: any string
@Binyang2014 @scarlett2018 Was this feature ready?