azurehpc icon indicating copy to clipboard operation
azurehpc copied to clipboard

start_gpu_data_collector.sh script failure when tried to excute

Open upendart opened this issue 2 years ago • 1 comments

Hello,

I am trying to create dashboard on azure for GPU Monitoring. I followed the steps mentioned and updated the fields with the required details in start_gpu_data_collector.sh file, when I tried to execute the scripts it throws below error. gethostbyname("::1") failed.

Then I updated the script and executed as below ./gpu_data_collector.py -tis $INTERVAL_SECS -dfi $DCGM_FIELD_IDS > /tmp/gpu_data_collector.log, though I don't see any GPU Monitor custom fields created in the Log Analytics Workspace.

upendart avatar May 12 '22 18:05 upendart

I am curious what modification to the script did you make to overcome the gethostbyname error?

Are you using this GPU monitoring script to monitor SLURM jobs GPU activity? (This is the default behavior). So, if you do not have a SLURM job running no GPU monitoring data will be sent to Azure monitor. If you would like all processes on nodes to be monitored (even if they are not associated with a SLURM job) for GPU activity and the data to be sent to Azure Monitor (then add the -fgm command line argument). I have made some corrections to the start-up and shutdown scripts (start_gpu_data_collector.sh, stop_gpu_data_collector.sh), see https://github.com/Azure/azurehpc/pull/584

If you still do not see any data being sent to log analytics (Custom logs), then please send me the stdout/stderr (/tmp/gpu_data_collector.log) when you execute the python script.

garvct avatar May 17 '22 19:05 garvct