nvda icon indicating copy to clipboard operation
nvda copied to clipboard

NVDA occasionally causes Firefox to stop responding due to deadlock with nvdaControllerInternal_vbufChangeNotify

Open jcsteh opened this issue 2 years ago • 4 comments

This happens extremely rarely, but I thought it worth capturing the info because the problem is pretty nasty.

Steps to reproduce:

I've most often seen this in Slack:

  1. Press control+k to open the quick switcher.
  2. Choose a conversation.
  3. Press enter to switch to it.

Actual behavior:

Firefox hangs and stops responding until you quit NVDA. NVDA remains responsive and usable except in Firefox.

Expected behavior:

No hang.

System configuration

NVDA installed/portable/running from source:

Source.

NVDA version:

Commit 99588410

Windows version:

Windows 11 Version 21H2 (OS Build 22000.1696)

Name and version of other software in use when reproducing the issue:

Firefox Nightly 113.0a1 (2023-03-28) (64-bit)

Analysis

Python stack from the relevant thread in NVDA ``` Python stack for thread 9500 (Dummy-107): File "C:\Users\jamie\src\nvda\source\NVDAHelper.py", line 452, in nvdaControllerInternal_vbufChangeNotify virtualBuffers.VirtualBuffer.changeNotify(rootDocHandle, rootID) File "C:\Users\jamie\src\nvda\source\virtualBuffers\__init__.py", line 729, in changeNotify queueHandler.queueFunction(queueHandler.eventQueue, cls.rootIdentifiers[rootDocHandle, rootID]._handleUpdate) File "C:\Users\jamie\src\nvda\source\queueHandler.py", line 51, in queueFunction queue.put_nowait((func,args,kwargs)) File "C:\Users\jamie\src\nvda\source\monkeyPatches\comtypesMonkeyPatches.py", line 130, in newCpbDel self._oldCpbDel() File "C:\Users\jamie\src\nvda\.venv\lib\site-packages\comtypes\__init__.py", line 957, in __del__ self.Release() File "C:\Users\jamie\src\nvda\.venv\lib\site-packages\comtypes\__init__.py", line 1212, in Release return self.__com_Release() File "C:\Users\jamie\src\nvda\source\monkeyPatches\comtypesMonkeyPatches.py", line 32, in __call__ return super().__call__(*args,**kwargs) ```
  1. In Firefox's main thread, nvdaHelperRemote calls nvdaControllerInternal_vbufChangeNotify.
  2. A background RPC thread in NVDA serves the request.
  3. That RPC thread queues the notification to the buffer in the main thread.
  4. While it's trying to put the entry into the queue, Python garbage collection occurs.
  5. It calls Release() on a COM object from Firefox.
  6. I don't have stacks from Firefox, but my guess is that this triggered a cross-process call to Firefox's main thread to release the COM object, since this is our last reference.
  7. Firefox's main thread is blocked in the sync call to nvdaControllerInternal_vbufChangeNotify (1). However, nvdaControllerInternal_vbufChangeNotify is blocked waiting for Firefox's main thread to return from Release (6). Deadlock!

Ideally, we don't want main thread COM objects getting released on background threads at all. However, it's particular egregious in this case because it potentially calls a thread which is blocked waiting for this RPC call to complete.

Possible avenues to explore:

  1. Why is this object being garbage collected at all? I guess we must have a cycle somewhere?
  2. Can we prevent garbage collection while handling RPC calls? We could temporarily disable gc at the start of an RPC call, but when do we re-enable it? We can't enable it before the call returns because otherwise, it'll potentially do a collection and land us with the same bug. Maybe we could queue enable to the main thread? I also worry that a collection might run before our disable call does. There's nothing we can do about that.
  3. In nvdaHelperRemote, can we queue the nvdaControllerInternal_vbufChangeNotify call to a background thread? This is a bit ugly, but avoids the problem of the main thread being blocked by this.

Note that this problem can probably happen with any other outgoing RPC call from the main thread. I'm particularly thinking about nvdaControllerInternal_reportLiveRegion.

jcsteh avatar Mar 29 '23 23:03 jcsteh

Yes some of the later versions of Waterfox hang like this or very similar to this as well. I went back to an old classic release to fix it. Trouble now is that this is noticeably lower processing javascript and is often rejected as an unsupported browser by some sites. Its annoying but these days one needs several browsers to be able tocope. Brian

-- @.*** Sent via blueyonder.(Virgin media) Please address personal E-mail to:- @., putting 'Brian Gaff' in the display name field. ----- Original Message ----- From: "James Teh" @.> To: "nvaccess/nvda" @.> Cc: "Subscribed" @.> Sent: Thursday, March 30, 2023 12:49 AM Subject: [nvaccess/nvda] NVDA occasionally causes Firefox to stop responding due to deadlock with nvdaControllerInternal_vbufChangeNotify (Issue #14764)

This happens extremely rarely, but I thought it worth capturing the info because the problem is pretty nasty.

Steps to reproduce:

I've most often seen this in Slack:

  1. Press control+k to open the quick switcher.
  2. Choose a conversation.
  3. Press enter to switch to it.

Actual behavior:

Firefox hangs and stops responding until you quit NVDA. NVDA remains responsive and usable except in Firefox.

Expected behavior:

No hang.

System configuration

NVDA installed/portable/running from source:

Source.

NVDA version:

Commit 99588410

Windows version:

Windows 11 Version 21H2 (OS Build 22000.1696)

Name and version of other software in use when reproducing the issue:

Firefox Nightly 113.0a1 (2023-03-28) (64-bit)

Analysis

Python stack from the relevant thread in NVDA ``` Python stack for thread 9500 (Dummy-107): File "C:\Users\jamie\src\nvda\source\NVDAHelper.py", line 452, in nvdaControllerInternal_vbufChangeNotify virtualBuffers.VirtualBuffer.changeNotify(rootDocHandle, rootID) File "C:\Users\jamie\src\nvda\source\virtualBuffers\__init__.py", line 729, in changeNotify queueHandler.queueFunction(queueHandler.eventQueue, cls.rootIdentifiers[rootDocHandle, rootID]._handleUpdate) File "C:\Users\jamie\src\nvda\source\queueHandler.py", line 51, in queueFunction queue.put_nowait((func,args,kwargs)) File "C:\Users\jamie\src\nvda\source\monkeyPatches\comtypesMonkeyPatches.py", line 130, in newCpbDel self._oldCpbDel() File "C:\Users\jamie\src\nvda\.venv\lib\site-packages\comtypes\__init__.py", line 957, in __del__ self.Release() File "C:\Users\jamie\src\nvda\.venv\lib\site-packages\comtypes\__init__.py", line 1212, in Release return self.__com_Release() File "C:\Users\jamie\src\nvda\source\monkeyPatches\comtypesMonkeyPatches.py", line 32, in __call__ return super().__call__(*args,**kwargs) ```
  1. In Firefox's main thread, nvdaHelperRemote calls nvdaControllerInternal_vbufChangeNotify.
  2. A background RPC thread in NVDA serves the request.
  3. That RPC thread queues the notification to the buffer in the main thread.
  4. While it's trying to put the entry into the queue, Python garbage collection occurs.
  5. It calls Release() on a COM object from Firefox.
  6. I don't have stacks from Firefox, but my guess is that this triggered a cross-process call to Firefox's main thread to release the COM object, since this is our last reference.
  7. Firefox's main thread is blocked in the sync call to nvdaControllerInternal_vbufChangeNotify (1). However, nvdaControllerInternal_vbufChangeNotify is blocked waiting for Firefox's main thread to return from Release (6). Deadlock!

Ideally, we don't want main thread COM objects getting released on background threads at all. However, it's particular egregious in this case because it potentially calls a thread which is blocked waiting for this RPC call to complete.

Possible avenues to explore:

  1. Why is this object being garbage collected at all? I guess we must have a cycle somewhere?
  2. Can we prevent garbage collection while handling RPC calls? We could temporarily disable gc at the start of an RPC call, but when do we re-enable it? We can't enable it before the call returns because otherwise, it'll potentially do a collection and land us with the same bug. Maybe we could queue enable to the main thread? I also worry that a collection might run before our disable call does. There's nothing we can do about that.
  3. In nvdaHelperRemote, can we queue the nvdaControllerInternal_vbufChangeNotify call to a background thread? This is a bit ugly, but avoids the problem of the main thread being blocked by this.

Note that this problem can probably happen with any other outgoing RPC call from the main thread. I'm particularly thinking about nvdaControllerInternal_reportLiveRegion.

-- Reply to this email directly or view it on GitHub: https://github.com/nvaccess/nvda/issues/14764 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

Brian1Gaff avatar Mar 30 '23 08:03 Brian1Gaff

Temporarily disabling the gc, etc. has been tried before and it won't work reliably. A collection can happen at any time.

Similarly, we perhaps should try to fix whatever cycle is causing this, but ultimately, cycles are always a possibility and there's always a risk this can happen. Given how nasty this is, we shouldn't permit that risk.

The only realistic solution that leaves is queuing this RPC call to a background thread. We already do this for nvdaHelperRemote logging. Note that there's no requirement for this call to be synchronous. We already queue it to the main thread and return without waiting for the main thread to complete the call.

jcsteh avatar Apr 02 '23 06:04 jcsteh

I can't seem to reproduce this in Firefox and slack. Do you have any other suggested STR? Also I'm guessing the solution suggested in https://github.com/nvaccess/nvda/issues/14764#issuecomment-1493236922 has evolved slightly given comments in https://github.com/nvaccess/nvda/issues/6463#issuecomment-2048872707

seanbudd avatar Apr 26 '24 02:04 seanbudd

I can't seem to reproduce this in Firefox and slack. Do you have any other suggested STR?

Unfortunately not. It's one of those issues that doesn't occur that often, but when it does, it's very nasty.

Also I'm guessing the solution suggested in #14764 (comment) has evolved slightly given comments in #6463 (comment)

The solution in https://github.com/nvaccess/nvda/issues/6463#issuecomment-2048872707 potentially covers more problems, but I'd say there's more unknowns and it's riskier and harder to implement.

jcsteh avatar Apr 26 '24 02:04 jcsteh