Binyang Li

Results 16 issues of Binyang Li

In our case. some container will write data to GPU to finish ML job. But some container only monitor the GPU resources. We want to only give the read permission...

node10 already deprecated, and many packages with newer version are not support node10. We need to upgrade to node14 or higher for package upgrade. Upgrade node version will break original...

pai-dev

In protocol v2, we remove the `debug mode`. User can not reserve the container after the job has finished(failed or succeed). We need to think about if we need to...

enhancement
pai-dev

Reported by customer. Submitted job always in waiting status. Check framework status. The status for framework is completed So the reason for this issue is framework watcher not sync the...

We notice some job is failed with exit code -1203 Some related error logs: ``` error while creating mount source path '/var/lib/kubelet/pods/...: file exists ``` And for docker daemon, the...

Fix ms graph API fail

We met user failed to login issue in prod bed. This is caused by MS graph group API failed and return 400. We find this API can return success after...

In some situation, user job will hang and need to attach the process the debug this issue. Currently, in container, we cannot using gdb to atttach the process due to...

We already has alert for this issue, but not cover all situations. - [ ] Add case: GPU perf in P0 status, but application clock not correct - [ ]...

pai-dev