Monitoring CPU,Mem and GPU
What is the current recommendation for monitoring the underlying CPU, Memory and GPU usage of a batch node? I want this information at a node level rather than at a job or task level. I have found batch-insights but this is now archived.
It looks like Application Insights is the officially recommended approach.
Batch Insights is no longer maintained, but might still work, for sending telemetry to Application Insights. It's unclear to me what purpose this served (perhaps it did what the Application Insights binaries now do?).
You can also browse in the Azure Portal to the Virtual machine scale set that Batch created, navigate to each individual node instance within it, and view Metrics for the instance.
Batch Explorer shows metrics for each node on the same graph (handy for eyeballing CPU usage of the whole pool), with the option to show an aggregate (single metric) for the pool as a whole.
No out of the box metrics? That seems kind of crazy that we can't get some baseline metrics without additional work.
Out of the box, you get default metrics for each individual node in the Azure Portal:
Batch → Features → Pools → (your pool) → General → Nodes → (pick a node) → Monitor
Or aggregates (e.g. average CPU across nodes) in Batch Explorer; would be nice if these were built into the Portal, too.
When I looked into this I was using the Microsoft Managed batch nodes so could not access the scale set. The conclusion I came to was that if I wanted GPU metrics I had to install a third party monitoring agent into the compute node image.
@rossdakin At least im not seeing any metrics even for individual nodes in the portal:
With Batch Explorer im able to see core minutes used, but still no metrics on CPU or Memory. Which leads to the conclusion that the only way to understand individual node performance metrics is to develop your own lightweight .net application and deploy it alongside your actual workload. Pretty bad experience tbh as exposing those metrics from the underlying VMSS should not be that complicated...
I think you're not able to see metrics because your Batch is set in Batch Service and not in User Subscription. This article explains for example how to install azure monitor extension though json at the pool deployment https://techcommunity.microsoft.com/blog/azurepaasblog/integrating-azure-monitor-in-azure-batch-to-monitor-batch-pool-nodes-performance/4428929
BUT it require the Batch to be set in User Subscription, it seems no one is able to monitor properly nodes in Batch Service Mode which is painfull and is not mentionned in MS Doc.
Can you clarify on the best way to monitor node when Batch is set in Batch Service please ?
Out of the box, you get default metrics for each individual node in the Azure Portal:
Batch → Features → Pools → (your pool) → General → Nodes → (pick a node) → Monitor
Or aggregates (e.g. average CPU across nodes) in Batch Explorer; would be nice if these were built into the Portal, too.
I believe that is gone as soon as your job stops though? Also if I am running 1000 nodes or even 10 this is not a real solution.