rr icon indicating copy to clipboard operation
rr copied to clipboard

Replay gets stuck or diverges on a Firefox plugin-container for Widevine protected video

Open jld opened this issue 2 years ago • 6 comments

STR:

  1. Recording Firefox (I've tested on local builds of mozilla-central but an official build will probably work), go to https://shaka-player-demo.appspot.com/demo/, install the plugin if prompted, and play “Sintel” (should be marked as Widevine-protected); once it starts playing, the browser can be closed.
  2. Replay the last plugin-container process (should have gmplugin as the last arg and mention gmp-widevinecdm); when the gdb prompt appears, c. (If the sandbox isn't disabled (see below) and this isn't being run from a Firefox source tree, do handle SIGSYS nostop noprint first.)

In one case this gave a Replay diverged; target registers at ticks target mismatched error; in other cases, it got stuck with an rr:plugin-container process spinning on the CPU in userspace, with the main rr process blocked in waitid waiting for it to stop, which it never did.

I first encountered this on an AMD Threadripper PRO 3975WX but have also reproduced it on an Intel Core i9-9900K, so I don't think it's CPU-specific. The kernels are both Debian 5.17 versions. The sandbox can be disabled (MOZ_DISABLE_GMP_SANDBOX=1 in the environment) or not; it doesn't seem to make a difference.

At one point @khuey suggested uploading the trace to Pernosco, which was able to replay the whole thing. There was also an attempt to use rr rerun on the part of the trace where the replay got stuck, which resulted in this error:

[FATAL src/PerfCounters.cc:733:read_ticks()] 
 (task 225329 (rec:214287) at time 3524)
 -> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 562916 ticks, expected no more than 550033

And the divergence error I got on that one replay attempt that didn't just silently get stuck:

(rr) c
Continuing.
[ERROR src/ReplaySession.cc:775:guard_overshoot()] Replay diverged; target registers at ticks target mismatched: 
[FATAL src/ReplaySession.cc:784:guard_overshoot()] 
 (task 323030 (rec:320100) at time 489978)
 -> Assertion `false' failed to hold. overshot target ticks=105440383 by 2

(Also, just for the record: I am specifically not trying to inspect the DRM module itself; I was debugging an issue in the Gecko code running in that process.)

jld avatar Jun 03 '22 01:06 jld

Does replay -a work? If so, it's possible that GDB's breakpoint insertion is triggering a divergence. We've seen that kind of thing before.

Keno avatar Jun 03 '22 01:06 Keno

Note that it doesn't have to be gdb, even with rr replay -a we'll use breakpoints internally and that has triggered divergence in the past (e.g. #3030). That it works in Pernosco is a sign something really awful is going on.

khuey avatar Jun 03 '22 02:06 khuey

And a DRM module is the sort of thing that would read its own code back to see if it's changed :(

khuey avatar Jun 03 '22 02:06 khuey

Perhaps we could add a command line switch where all breakpoints are automatically promoted to hardware breakpoints? That's a limited resource of course, but we could in theory replay the same time range twice with different set of hardware breakpoints to figure out if any of them fired.

Keno avatar Jun 03 '22 03:06 Keno

Yeah that was what I came up with in https://github.com/rr-debugger/rr/issues/3030#issuecomment-1013581049. Perhaps we should just do it.

khuey avatar Jun 03 '22 03:06 khuey

Sounds like this change would actually provide a workaround for #3030 (which is closed, not solved) without recompilation of "code under replay" (or even worse like in this case: changes to that code).

GitMensch avatar Jun 24 '22 12:06 GitMensch