dnceng icon indicating copy to clipboard operation
dnceng copied to clipboard

Increasing visibility into the time builds & Helix tests take, and Helix errors

Open dotnet-bot opened this issue 2 years ago • 8 comments

Increasing visibility into the time builds & Helix tests take, and Helix errors

Generally, our customers are dissatisifed when it comes to understanding build times and Helix test times. From speaking with the CI counsil and gathering feedback from our key partners, I've discovered the following problem areas:

  • Queue depth isn't a helpful metric, instead, is this queue depth normal for this given queue?
    • Is this normal for a given queue?
  • Is this Helix failure due to the pool being overwhelmed or is there an infrastructure issue?
  • What's the rolling average to get a build machine?
  • For a new PR, how long will it take for the entire CI pipeline to complete?
    • What's the percent chance of failure before AzDo times out?
  • How do queues look for your specific pipeline?

Questions that our customers would like to know but we need further clarification on what precisely they mean:

  • Whats the expected average of runtime completion?
    • The entire pipeline? The Helix tests?
  • What are the status of machines?
    • What does status mean?

Currently, we serve our customers with our Grafana dashboard. But this dashboard is not known by a majority of our customers. The dashboard also doesn't provide helpful metrics to our customers. The graphs need revisions and it is too technical and not relevant without context.

A clean, user-centric approach is needed to help our devs understand the current states of queues, if they're operating nominally, and set their expectiaton when it comes to build & test infrastructure.

Motivation and Business Impact

This epic will help our customers become more productive by reducing their time spent on investigating pipelines.

The epic is complete when:

  • [ ] Our customers no longer have to wonder about the status of the CI pipeline, and are aware of long queues, infrastructure issues, etc.
  • [ ] Our customers can easily determine the health of Helix, with no knowledge of a hidden dashboard. A new dev should be able to determine this with no prior knowledge of the CI pipeline
  • [ ] Our customers can understand the expected wait time for their builds and tests to complete

Milestones

In progress.

One-Pager

In progress

Recently Triaged Issues

All issues in this section should be triaged by the v-team into one of their business objectives or features.

dotnet-bot avatar Apr 07 '22 20:04 dotnet-bot