poetryben888
poetryben888
openpai/openpai-runtime container on pai-work node error log: ``` root@pai-worker1:/usr/local# docker logs 6e354aae0a40 + CHILD_PROCESS=UNKNOWN + trap exit_handler EXIT + PAI_WORK_DIR=/usr/local/pai + PAI_CONFIG_DIR=/usr/local/pai-config + PAI_INIT_DIR=/usr/local/pai/init.d + PAI_RUNTIME_DIR=/usr/local/pai/runtime.d + PAI_LOG_DIR=/usr/local/pai/logs/2e4061a5-244c-45af-8ede-b4b9f6604774 + PAI_SECRET_DIR=/usr/local/pai/secrets...
on pai node , run:systemctl status kubelet ● kubelet.service - Kubernetes Kubelet Server Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2021-11-04 16:02:04 CST; 7s ago...
My dev machine and master worker restarted , then k8s and openpai was not started . What command shall I type on master worker to start the services(k8s and openpai)?...
I don't know why job does not run. stdout & stderr always : Log folder can not be retrieved I use the default template :hello-world-job.yaml  
Cert-expiration-checker pod status is ImagePullBackOff , it seems image has not been pulled. But master works has the images. 1.  2. root@pai-master:/home# kubectl describe pod cert-expiration-checker-1634083200-zm8kn  3. ...
2021-10-11 09:35:40,888 [INFO] - deployment.paiLibrary.paiService.service_management_start : Begin to clean all service's generated template file 2021-10-11 09:35:40,889 [INFO] - deployment.paiLibrary.paiService.service_template_clean : Begin to delete the generated template of marketplace-webportal's service. 2021-10-11...
1. **when i install openpai, run "/bin/bash quick-start-service.sh", it occurs this errors:** _.......... internal-storage-create is not ready yet. Please wait for a moment! internal-storage-create is not ready yet. Please wait...
按照你的脚本跑,一直报错,找不到原因。 ``` root@pai-worker1:/home/Data/exports/pytorch-distributed# srun -N1 -n2 --gres gpu:2 python distributed_slurm_main.py --dist-file dist_file Traceback (most recent call last): File "distributed_slurm_main.py", line 420, in main() File "distributed_slurm_main.py", line 131, in main mp.spawn(main_worker,...
openpai/openpai-runtime: barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
root@pai-worker1:/etc/kubernetes# docker logs b441c50e30fa + CHILD_PROCESS=UNKNOWN + trap exit_handler EXIT + PAI_WORK_DIR=/usr/local/pai + PAI_CONFIG_DIR=/usr/local/pai-config + PAI_INIT_DIR=/usr/local/pai/init.d + PAI_RUNTIME_DIR=/usr/local/pai/runtime.d + PAI_LOG_DIR=/usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df + PAI_SECRET_DIR=/usr/local/pai/secrets + PAI_USER_EXTENSION_SECRET_DIR=/usr/local/pai/user-extension-secrets + PAI_TOKEN_SECRET_DIR=/usr/local/pai/token-secrets + chmod a+rw /usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df...