rr
rr copied to clipboard
Replay gets stuck or diverges on a Firefox plugin-container for Widevine protected video
STR:
- Recording Firefox (I've tested on local builds of mozilla-central but an official build will probably work), go to https://shaka-player-demo.appspot.com/demo/, install the plugin if prompted, and play “Sintel” (should be marked as Widevine-protected); once it starts playing, the browser can be closed.
- Replay the last plugin-container process (should have
gmplugin
as the last arg and mentiongmp-widevinecdm
); when the gdb prompt appears,c
. (If the sandbox isn't disabled (see below) and this isn't being run from a Firefox source tree, dohandle SIGSYS nostop noprint
first.)
In one case this gave a Replay diverged; target registers at ticks target mismatched
error; in other cases, it got stuck with an rr:plugin-container
process spinning on the CPU in userspace, with the main rr
process blocked in waitid
waiting for it to stop, which it never did.
I first encountered this on an AMD Threadripper PRO 3975WX but have also reproduced it on an Intel Core i9-9900K, so I don't think it's CPU-specific. The kernels are both Debian 5.17 versions. The sandbox can be disabled (MOZ_DISABLE_GMP_SANDBOX=1
in the environment) or not; it doesn't seem to make a difference.
At one point @khuey suggested uploading the trace to Pernosco, which was able to replay the whole thing. There was also an attempt to use rr rerun
on the part of the trace where the replay got stuck, which resulted in this error:
[FATAL src/PerfCounters.cc:733:read_ticks()]
(task 225329 (rec:214287) at time 3524)
-> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 562916 ticks, expected no more than 550033
And the divergence error I got on that one replay attempt that didn't just silently get stuck:
(rr) c
Continuing.
[ERROR src/ReplaySession.cc:775:guard_overshoot()] Replay diverged; target registers at ticks target mismatched:
[FATAL src/ReplaySession.cc:784:guard_overshoot()]
(task 323030 (rec:320100) at time 489978)
-> Assertion `false' failed to hold. overshot target ticks=105440383 by 2
(Also, just for the record: I am specifically not trying to inspect the DRM module itself; I was debugging an issue in the Gecko code running in that process.)
Does replay -a work? If so, it's possible that GDB's breakpoint insertion is triggering a divergence. We've seen that kind of thing before.
Note that it doesn't have to be gdb, even with rr replay -a
we'll use breakpoints internally and that has triggered divergence in the past (e.g. #3030). That it works in Pernosco is a sign something really awful is going on.
And a DRM module is the sort of thing that would read its own code back to see if it's changed :(
Perhaps we could add a command line switch where all breakpoints are automatically promoted to hardware breakpoints? That's a limited resource of course, but we could in theory replay the same time range twice with different set of hardware breakpoints to figure out if any of them fired.
Yeah that was what I came up with in https://github.com/rr-debugger/rr/issues/3030#issuecomment-1013581049. Perhaps we should just do it.
Sounds like this change would actually provide a workaround for #3030 (which is closed, not solved) without recompilation of "code under replay" (or even worse like in this case: changes to that code).