go-livepeer P1 - Success rate metric only calculated based on NoOrchestrator transcode errors

Describe the bug A clear and concise description of what the bug is.

The success rate Grafana graph here shows success rate consistently >= 100% even though we know that there have been transcode failures.

From reviewing the metrics code, it appears that the success rate metric is updated whenever there is a call to census.sendSuccess() in monitor/census.go.

I see this method being called in at least two places:

In SegmentFullyTranscoded() which is called here after B finishes downloading all results from O
In segmentTranscodeFailed() which is called in SegmentTranscodeFailed() which is called whenever a transcode error is encountered

For 2, there is a concept of a "permanent" vs. "non-permanent" transcode error indicated via the permanent bool passed to SegmentTranscodeFailed(). We can see non-permanent errors being recorded here. The only place where there is a permanent error recorded is here for NoOrchestrator transcode errors. This seems problematic because only permanent errors will trigger a call to census.sendSuccess() here when recording a transcode error. As a result, I don't think we are properly updating the success rate metric in at least two places:

When we hit the max # of transcode attempts which prompts B to give up on a segment here
When we hit a non-retryable transcode error which prompts B to give up on a segment here
When the caller (i.e. HTTP push client) gives up on the transcode resulting in a context cancellation here

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

We could trigger the aforementioned errors that are not being factored in right now to see if the success rate is not affected. A solution should demonstrate that these errors cause the success rate to drop.

Expected behavior A clear and concise description of what you expected to happen.

I expect the success rate metric to properly factor in all transcode errors that result in no renditions for a segment that is passed in.

Generally, I see at least these categories of transcode errors that should cause success rate to drop:

If B hits the max # of transcode attempts - B should give up on the segment b/c it tried enough times already
If B hits a non-retryable error - B should give up on the segment b/c it knows that this segment likely just cannot be transcoded
If B knows that the caller (i.e. HTTP push client) is no longer waiting for a result - B should give up on the segment b/c it knows no one cares about the results anymore because the transcode was too slow

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context Add any other context about the problem here.

Nov 30 '22 15:11 yondonfu

@mjh1 promise this is the last one, but it's quite a good one for you to take since it involves understanding some of the Broadcaster logs and how we record metrics

Dec 07 '22 14:12 thomshutt

success rate metrics Why? Improves our visibility and can compare better with Evan’s graph. Difficult to investigate right now

Dec 07 '22 16:12 Quintendos

@yondonfu do you think rather than making sure we emit the metric for every failure scenario we could introduce a metric for "transcode started" (if it doesn't exist) then use transcode_success/transcode_start for our success rate? That way we wouldn't have to worry about potentially missing any obscure failure scenarios?

e.g. https://github.com/livepeer/go-livepeer/pull/2684/commits/abd13349d6c5791114a819e0f42889276ee227bf but then I see in some cases the SegmentTranscoded metric is fired from the different transcoder implementations so maybe we would need to fire the transcode start metric there too.

Dec 12 '22 11:12 mjh1

do you think rather than making sure we emit the metric for every failure scenario we could introduce a metric for "transcode started" (if it doesn't exist) then use transcode_success/transcode_start for our success rate? That way we wouldn't have to worry about potentially missing any obscure failure scenarios?

@mjh1 Does that mean that if B starts multiple transcodes for segments concurrently, during the period of time until the transcodes complete we could see that success rate metric drop since transcode_start would be incremented, but B is still waiting for the transcodes to complete? That's the main concern that comes to mind and reason for updating the success metric only when a transcode is complete (either with a success or a legit failure).

Jan 04 '23 18:01 yondonfu

ah yep you're totally right, will forget that idea

Jan 06 '23 16:01 mjh1

Reopen until change deployed

Jan 13 '23 10:01 mjh1