azurehpc
azurehpc copied to clipboard
start_gpu_data_collector.sh script failure when tried to excute
Hello,
I am trying to create dashboard on azure for GPU Monitoring. I followed the steps mentioned and updated the fields with the required details in start_gpu_data_collector.sh file, when I tried to execute the scripts it throws below error. gethostbyname("::1") failed.
Then I updated the script and executed as below ./gpu_data_collector.py -tis $INTERVAL_SECS -dfi $DCGM_FIELD_IDS > /tmp/gpu_data_collector.log, though I don't see any GPU Monitor custom fields created in the Log Analytics Workspace.
I am curious what modification to the script did you make to overcome the gethostbyname error?
Are you using this GPU monitoring script to monitor SLURM jobs GPU activity? (This is the default behavior). So, if you do not have a SLURM job running no GPU monitoring data will be sent to Azure monitor. If you would like all processes on nodes to be monitored (even if they are not associated with a SLURM job) for GPU activity and the data to be sent to Azure Monitor (then add the -fgm command line argument). I have made some corrections to the start-up and shutdown scripts (start_gpu_data_collector.sh, stop_gpu_data_collector.sh), see https://github.com/Azure/azurehpc/pull/584
If you still do not see any data being sent to log analytics (Custom logs), then please send me the stdout/stderr (/tmp/gpu_data_collector.log) when you execute the python script.