linux icon indicating copy to clipboard operation
linux copied to clipboard

drm/vc4: Fix gpu reset

Open carlonluca opened this issue 1 year ago • 3 comments

This commit changes the bind procedure by avoiding the call to pm_runtime_resume_and_get. Since commit #71be60f, the reset of the GPU caused incorrect rendering and crashes. With this patch the GPU reset procedure properly resets the GPU, the rendering resumes properly and no crash occurs.

carlonluca avatar Jun 25 '24 16:06 carlonluca

The rpi-6.1.y branch is now the legacy branch and isn't taking new downstream features/fixes, only fixes from upstream.

vc4_v3d.c is also almost unchanged from mainline, and those minor diffs should be merged upstream imminently. Fixes should be submitted to dri-devel for review.

The patch also feels dubious as you now have a call to clk_prepare_enable with no clk_disable_unprepare, and no call to power the domain before accessing the registers. That may mask a problem in the reset path that does something wrong in recovery, but it's only by messing up the accounting in pm_runtime and clks. The original patch commit text describes the issue on unbind and bind afresh that it was fixing. Have you tested that scenario with your fix?

Minor comment that 71be60f is a hash from the rpi-5.15.y branch. It was merged into mainline as https://github.com/torvalds/linux/commit/266cff37d7fcec0443e720878242cb52d728138b in the 6.1 merge window.

6by9 avatar Jun 25 '24 17:06 6by9

Just reenforcing @6by9 comment, I can't see how this patch could fix any GPU reset issue. By not using pm_runtime_resume_and_get(), we ignore the device's usage counter, which means that its value won't be incremented, and this will compromise the whole PM mechanism. Probably, this solution is just hiding a defect in the GPU reset mechanism.

About moving V3D_WRITE(V3D_BPOA, 0); and V3D_WRITE(V3D_BPOS, 0);, I cannot see how this could influence the GPU reset. The out-of-memory interrupt won't be enabled on vc4_irq_enable().

Check the comment on vc4_irq_enable():

	/* Enable the render done interrupts. The out-of-memory interrupt is
	 * enabled as soon as we have a binner BO allocated.
	 */

It is hard for me to see what this patch is trying to achieve. I believe it might fix your problem because the device's usage counter was compromised and it wasn't incremented.

mairacanal avatar Jul 06 '24 19:07 mairacanal

Thank you for the detailed feedback, @6by9 and @mairacanal.

We understand the concerns about avoiding pm_runtime_resume_and_get() and its impact on the device's usage counter. The intention behind this change was to resolve a specific GPU reset issue that resulted in rendering failures and crashes, particularly after commit 71be60f. However, we acknowledge that bypassing the PM mechanism might not be a long-term solution and could mask deeper issues related to the reset path and power management.

Regarding the movement of V3D_WRITE(V3D_BPOA, 0) and V3D_WRITE(V3D_BPOS, 0), we will revisit this as well since it was not intended to affect the out-of-memory interrupt behavior but to address rendering inconsistencies noticed during reset.

We'll take your advice to perform additional testing, especially in unbind/bind scenarios, and also explore a more robust fix that properly handles the usage counter and clock management without compromising PM. We'll consider submitting the fix to dri-devel for further review and work on aligning the approach with upstream developments.

Thanks again for pointing this out.

kiruthikpurpose avatar Oct 10 '24 04:10 kiruthikpurpose