openpai-runtime
openpai-runtime copied to clipboard
Runtime for deep learning workload
openpai/openpai-runtime: barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
root@pai-worker1:/etc/kubernetes# docker logs b441c50e30fa + CHILD_PROCESS=UNKNOWN + trap exit_handler EXIT + PAI_WORK_DIR=/usr/local/pai + PAI_CONFIG_DIR=/usr/local/pai-config + PAI_INIT_DIR=/usr/local/pai/init.d + PAI_RUNTIME_DIR=/usr/local/pai/runtime.d + PAI_LOG_DIR=/usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df + PAI_SECRET_DIR=/usr/local/pai/secrets + PAI_USER_EXTENSION_SECRET_DIR=/usr/local/pai/user-extension-secrets + PAI_TOKEN_SECRET_DIR=/usr/local/pai/token-secrets + chmod a+rw /usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df...
When running large scale job,, clone code will encounter: `requested URL returned error: 429`, Need to enhance git plugin to handle this case and let task always retry.
# Current situation: Currently, opnepai-runtime is tightly coupled with PAI and Framework Controller. We just split the code but some logic is mixed. To use runtime, we need to use...
Currently, runtimeUnkonwnError will be treated as runtimeAbortExit error. We should distinguish these two errors.