openpai-runtime icon indicating copy to clipboard operation
openpai-runtime copied to clipboard

Runtime for deep learning workload

Results 4 openpai-runtime issues
Sort by recently updated
recently updated
newest added

root@pai-worker1:/etc/kubernetes# docker logs b441c50e30fa + CHILD_PROCESS=UNKNOWN + trap exit_handler EXIT + PAI_WORK_DIR=/usr/local/pai + PAI_CONFIG_DIR=/usr/local/pai-config + PAI_INIT_DIR=/usr/local/pai/init.d + PAI_RUNTIME_DIR=/usr/local/pai/runtime.d + PAI_LOG_DIR=/usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df + PAI_SECRET_DIR=/usr/local/pai/secrets + PAI_USER_EXTENSION_SECRET_DIR=/usr/local/pai/user-extension-secrets + PAI_TOKEN_SECRET_DIR=/usr/local/pai/token-secrets + chmod a+rw /usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df...

When running large scale job,, clone code will encounter: `requested URL returned error: 429`, Need to enhance git plugin to handle this case and let task always retry.

# Current situation: Currently, opnepai-runtime is tightly coupled with PAI and Framework Controller. We just split the code but some logic is mixed. To use runtime, we need to use...

Currently, runtimeUnkonwnError will be treated as runtimeAbortExit error. We should distinguish these two errors.