arena "arene top job" couldn't detect metrics.

I follow guide: arena/docs/userguide/9-top-job-gpu-metric.md.

everything works as expect until last one, when I submit the tfjob anr use "arena to job" to check, the result shows like this:

ERRO[0000] gpu metric is not exist in prometheus for query  {__name__=~"nvidia_gpu_duty_cycle|nvidia_gpu_memory_used_bytes|nvidia_gpu_memory_total_bytes", pod_name=~""}
INSTANCE NAME  GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)  STATUS  NODE

Jan 15 '19 06:01 soolaugust

Finally I found the reason cause this:

Kubeadm disable read-only-port 10255 by default since 1.11, refet to kubeadm: Improve the kubelet default configuration security-wise , so cadvisor couldn't detect metrics by accessing 10255 port.

fixing this by change kubelet config file /var/lib/kubelet/config.yaml, add readOnlyPort: 10255., and restart kubelet: systemctl daemon-reload & systemctl restart kubelet.

after fixing 1st problem, I found the result of getting pods is empty, this is fixed by Fix for top job metric and list (#106).

Jan 23 '19 00:01 soolaugust

@cheyang

I think there should be some tips about 1st problem in guide. because current guide about "Monitor GPUs of the training job" is not working in later 1.11 version of Kubeadm.

Jan 23 '19 01:01 soolaugust

/assign cheyang

Jan 23 '19 01:01 soolaugust

@xiaozhouX please take a look.

Jan 24 '19 02:01 cheyang

Thanks for your feedback! In GPU exporter, we will call 10255 port of node for getting node GPU allocation for pods. I'm thinking about replace it by read device-plugin's checkpoint file (in /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint ) , another way is reading cgroup devices.list file (which is done by cadvisor before). These two way will cause different behavior when there are hostIPC Pod. What's your suggestion? @cheyang @soolaugust

Jan 27 '19 10:01 xiaozhouX

I am not familiar about these two ways. is there any reference? I want to dig into it.

Jan 27 '19 23:01 soolaugust

So any progress here? Same issue here. @xiaozhouX

Feb 03 '19 13:02 yeya24

So any progress here? Same issue here. @xiaozhouX

For now, we can only open kubelet's 10255 port. We will solve this as soon as possible.

Feb 14 '19 03:02 xiaozhouX

arena arena copied to clipboard

"arene top job" couldn't detect metrics.

arena
arena copied to clipboard