pai icon indicating copy to clipboard operation
pai copied to clipboard

Reserve container after job failed.

Open Binyang2014 opened this issue 5 years ago • 3 comments

In protocol v2, we remove the debug mode. User can not reserve the container after the job has finished(failed or succeed). We need to think about if we need to add this feature back.

Several things need to consider.

  1. If remote-development can satisfy user requirement
  2. How to prevent consume too much resource after this feature is enabled

refer to: https://github.com/microsoft/pai/issues/3598

Binyang2014 avatar Sep 17 '19 08:09 Binyang2014

Add my vote for adding it back.

scarlett2018 avatar Sep 20 '19 02:09 scarlett2018

In previous version. When container is reserved for debug purpose, the job still in RUNNING status. It cause a lot of confusing for admin & users. To add this feature back, we should expose more runtime lifecycle status. Then other service can use this information and show accurate status to end user.

The runtime status currently contains:

  1. Executing pre command
  2. Executing user command
  3. Executing post command
  4. Collecting exist info

And we should add Reserved for debugging status.

And the status will write to a predefined file, such as /pai/runtime-status, For each status change, runtime will update the file. File format can be:

status:
- name: statusName
  startTimestamp: status start timestamp
  endTimestamp: status end timestamp
  message: any string

Binyang2014 avatar Oct 25 '19 07:10 Binyang2014

@Binyang2014 @scarlett2018 Was this feature ready?

edenbuaa avatar Dec 09 '21 04:12 edenbuaa