poetryben888

Results 9 issues of poetryben888

openpai/openpai-runtime container on pai-work node error log: ``` root@pai-worker1:/usr/local# docker logs 6e354aae0a40 + CHILD_PROCESS=UNKNOWN + trap exit_handler EXIT + PAI_WORK_DIR=/usr/local/pai + PAI_CONFIG_DIR=/usr/local/pai-config + PAI_INIT_DIR=/usr/local/pai/init.d + PAI_RUNTIME_DIR=/usr/local/pai/runtime.d + PAI_LOG_DIR=/usr/local/pai/logs/2e4061a5-244c-45af-8ede-b4b9f6604774 + PAI_SECRET_DIR=/usr/local/pai/secrets...

on pai node , run:systemctl status kubelet ● kubelet.service - Kubernetes Kubelet Server Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2021-11-04 16:02:04 CST; 7s ago...

My dev machine and master worker restarted , then k8s and openpai was not started . What command shall I type on master worker to start the services(k8s and openpai)?...

I don't know why job does not run. stdout & stderr always : Log folder can not be retrieved I use the default template :hello-world-job.yaml ![image](https://user-images.githubusercontent.com/15098245/137296415-6d8f1967-98d1-44b1-adb7-303ed046898c.png) ![image](https://user-images.githubusercontent.com/15098245/137296379-e0baee6f-8dae-46df-b6f1-0e187b79d55a.png)

Cert-expiration-checker pod status is ImagePullBackOff , it seems image has not been pulled. But master works has the images. 1. ![image](https://user-images.githubusercontent.com/15098245/137111998-5d9452d8-c392-438d-b47f-aeec0f793903.png) 2. root@pai-master:/home# kubectl describe pod cert-expiration-checker-1634083200-zm8kn ![image](https://user-images.githubusercontent.com/15098245/137112188-a00d351c-d3c5-44d7-a310-1955f969c656.png) 3. ![image](https://user-images.githubusercontent.com/15098245/137112257-c35d0f24-30d9-482e-b46b-d84babbf60e9.png)...

2021-10-11 09:35:40,888 [INFO] - deployment.paiLibrary.paiService.service_management_start : Begin to clean all service's generated template file 2021-10-11 09:35:40,889 [INFO] - deployment.paiLibrary.paiService.service_template_clean : Begin to delete the generated template of marketplace-webportal's service. 2021-10-11...

1. **when i install openpai, run "/bin/bash quick-start-service.sh", it occurs this errors:** _.......... internal-storage-create is not ready yet. Please wait for a moment! internal-storage-create is not ready yet. Please wait...

按照你的脚本跑,一直报错,找不到原因。 ``` root@pai-worker1:/home/Data/exports/pytorch-distributed# srun -N1 -n2 --gres gpu:2 python distributed_slurm_main.py --dist-file dist_file Traceback (most recent call last): File "distributed_slurm_main.py", line 420, in main() File "distributed_slurm_main.py", line 131, in main mp.spawn(main_worker,...

root@pai-worker1:/etc/kubernetes# docker logs b441c50e30fa + CHILD_PROCESS=UNKNOWN + trap exit_handler EXIT + PAI_WORK_DIR=/usr/local/pai + PAI_CONFIG_DIR=/usr/local/pai-config + PAI_INIT_DIR=/usr/local/pai/init.d + PAI_RUNTIME_DIR=/usr/local/pai/runtime.d + PAI_LOG_DIR=/usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df + PAI_SECRET_DIR=/usr/local/pai/secrets + PAI_USER_EXTENSION_SECRET_DIR=/usr/local/pai/user-extension-secrets + PAI_TOKEN_SECRET_DIR=/usr/local/pai/token-secrets + chmod a+rw /usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df...