Michael Stuckey
Michael Stuckey
Triage: For Helix, a Big Improvement would be to recycle machines for every build (hosted pool does this).
I'd love to see telemetry added (probably added to Grafana) that tracks stats around throughput of a pool. Instance count, queue length, wait times, off the top of my head.
PR: [!52593](https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/52593)
This looks like an infrastructure problem. I see "No space left on device" errors in telemetry on the dnceng side. It's a known problem with azurelinux.3 and we are working...
Uptime for machines managed by Helix depends on the queue demand. The machines are kept alive until the load diminishes, then the scaleset scales down by removing the oldest machines...
> Am I reading the data correctly that the longest that particular machine Not any single machine, this is the 50th, 75th and 95th percentile for the "uptime" of all...
> had it in my head that the average Helix machine had an uptime more like several days This could be true for the very, very busy queues. I also...
> But the max I saw was 177 minutes; does that sound right to you that the longest any machine went without a reboot was 177 minutes? Looks like that's...
Closing as this has languished and doesn't seem to be blocking anything. Please reopen if that's incorrect.
PR: [Pull request 42721: Allow additional tags for managed storage accounts - Repos](https://dev.azure.com/dnceng/internal/_git/dotnet-helix-service/pullrequest/42721)