arena
arena copied to clipboard
"arene top job" couldn't detect metrics.
I follow guide: arena/docs/userguide/9-top-job-gpu-metric.md.
everything works as expect until last one, when I submit the tfjob anr use "arena to job" to check, the result shows like this:
ERRO[0000] gpu metric is not exist in prometheus for query {__name__=~"nvidia_gpu_duty_cycle|nvidia_gpu_memory_used_bytes|nvidia_gpu_memory_total_bytes", pod_name=~""}
INSTANCE NAME GPU(Device Index) GPU(Duty Cycle) GPU(Memory MiB) STATUS NODE
Finally I found the reason cause this:
- Kubeadm disable read-only-port 10255 by default since 1.11, refet to kubeadm: Improve the kubelet default configuration security-wise , so cadvisor couldn't detect metrics by accessing 10255 port.
fixing this by change kubelet config file
/var/lib/kubelet/config.yaml
, add readOnlyPort: 10255., and restart kubelet:systemctl daemon-reload
&systemctl restart kubelet
.
- after fixing 1st problem, I found the result of getting pods is empty, this is fixed by Fix for top job metric and list (#106).
@cheyang
I think there should be some tips about 1st problem in guide. because current guide about "Monitor GPUs of the training job" is not working in later 1.11 version of Kubeadm.
/assign cheyang
@xiaozhouX please take a look.
Thanks for your feedback!
In GPU exporter, we will call 10255 port of node for getting node GPU allocation for pods.
I'm thinking about replace it by read device-plugin's checkpoint file (in /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
) , another way is reading cgroup devices.list file (which is done by cadvisor before).
These two way will cause different behavior when there are hostIPC Pod.
What's your suggestion? @cheyang @soolaugust
I am not familiar about these two ways. is there any reference? I want to dig into it.
So any progress here? Same issue here. @xiaozhouX
So any progress here? Same issue here. @xiaozhouX
For now, we can only open kubelet's 10255 port. We will solve this as soon as possible.