arcade-services Gather publishing metrics

We should gather publishing metrics for each publish, so that we can track some key items which may affect performance:

Number of files
How big is the input file set
How much data is actually sent (e.g. we may download all of PackageArtifacts and BlobArtifacts), but then publish all of PackageArtifacts to two locations, and all of BlobArtifacts to one location
How long nuget pushes take
How long file pushes take

Mar 13 '20 21:03 mmitche

@Chrisboh @chcosta I think the timeline data would be great to use for this, as we could plot it over time. @adiaaida said it looks like right now we're only tracking warning and errors though. We wouldn't want to use that for this. Is there another logging option we could use to get this into the Kusto DB?

Mar 13 '20 21:03 mmitche

Gentle ping. This looks quite interesting to have. Which epic would this be better on?

Mar 27 '20 00:03 JohnTortugo

This feels like a good candidate for the epic @jcagme is driving.
@markwilkie your thoughts?

Mar 27 '20 00:03 Chrisboh

I'm not sure it fits in our epic, because it would be tracking things that happen in regular builds, rather than things that happen during release.

I would say it belonged in the PKPI epic, but we've all moved off of it.

Mar 27 '20 02:03 michellemcdaniel

Yeah, things happening during release, which will include, publishing to nuget.org and official feeds are already instrumented and we can measure how much each task took. I think the goal of this epic is to measure build to build publishing.

Mar 27 '20 16:03 jcagme

Is there already data on how long each "step" takes in each ring? Assuming so, is that sufficient @mmitche ?

Mar 27 '20 18:03 markwilkie

This should already exist for all build in the 1ES Data Kusto tables. That's how we get data for the release pipeline

Mar 27 '20 18:03 jcagme

My understanding of this issue is that @mmitche created it when we were trying to figure out why official builds for various repos were taking so long when we were putting the data together for the town hall. Matt suggested it would be cool if we collected data during the slow running legs like using @chcosta's PipelineLogger functionalities, but logging them as info or something similar, instead of warning or error (the only 2 options now). Once we had that logging in, we would then use the normal methods we currently have to put it in Kusto, where we can query it and analyze it so we could figure out the best things to tackle when trying to drive the publishing steps down. Right now, the data that he's interested in isn't logged in any way, and we don't quite have the mechanism to log them (though we could just log it as warnings, and then use that).

Mar 27 '20 18:03 michellemcdaniel

My understanding of this issue is that @mmitche created it when we were trying to figure out why official builds for various repos were taking so long when we were putting the data together for the town hall. Matt suggested it would be cool if we collected data during the slow running legs like using @chcosta's PipelineLogger functionalities, but logging them as info or something similar, instead of warning or error (the only 2 options now). Once we had that logging in, we would then use the normal methods we currently have to put it in Kusto, where we can query it and analyze it so we could figure out the best things to tackle when trying to drive the publishing steps down. Right now, the data that he's interested in isn't logged in any way, and we don't quite have the mechanism to log them (though we could just log it as warnings, and then use that).

Yep. that's right.

Mar 27 '20 19:03 mmitche

I see. So the work is to perhaps add an "info" type of logging?

Mar 27 '20 19:03 markwilkie

"Info" type logging isn't really supported by the timeline. The documentation kind of implies that it is, but it doesn't work. I've talked with AzDo quite a bit about it and was unable to inject any custom info message into the timeline. Additionally, when we previously went down the route of adding app insights type logging into the builds, we experienced significant build slowness due to network issues even though it was supposed to be a fire and forget type logging with minimal impact.

Mar 27 '20 20:03 chcosta

We probably already have "task" level metrics, it's just a question of figuring out how to interpret it in a way that is meaningful.

Mar 27 '20 20:03 chcosta

@mmitche What do you think should happen here?

Apr 28 '20 20:04 garath

Maybe the right thing to do is to simply print out info about the publishing at the end (see metrics from the top level issue description). When we see regressions based on the timeline data, we can either mine or inspect the metrics. It's not super elegant, but it would work and probably be fine for our purposes.

Apr 28 '20 21:04 mmitche