lizzie
lizzie copied to clipboard
Deadlock and Lizzie hangs
I've been stumbling on a Lizzie hangs for some time but have not bothered to look into it. Today I took a thread dump and the JVM had discovered a deadlock. I accidentally lost the the dump but I did go through the code based on the thread dump and the methods which had deadlocked where
Leelaz - public List<MoveData> getBestMoves() Leelaz - private void parseLine(String line)
They both take a lock on this object. Based on my quick analysis my proposal is actually to stop syncrhonising in the getBestMoves() as I could not figure out why should it be synchronised.
Here is my change https://github.com/toomasr/lizzie/commit/72a235c3c21a0b3484b4522a6ad514330d3c993d
I don't know for sure whether this is the correct solution but even if it is not then the developers will actually know that there is a verified deadlock in the program.
Can you reproduce the deadlock? If so, would you try #782? I moved refresh() in LizzieMain.java to the event dispatch thread (a6280681). This may fix the deadlock.
In my case...
- Start Lizzie.
- When an empty board is shown, immediately repeat hitting my "special key" for 3 seconds or so with about 2 hits/sec. This "special key" toggles the window size between "full screen" and "normal" in my customized window manager.
- deadlock with almost 100% probability on master (ebeb4469) with Panel UI
- no deadlock on master (ebeb4469) with normal UI
- no deadlock on #782 (c500a89d) with both Panel UI and normal UI
In the deadlock case, debug prints say as follows:
pool-2-thread-1 is stopped just before synchronized (target.getTreeLock())
in layoutContainer().
at featurecat.lizzie.gui.LizzieLayout.layoutContainer(LizzieLayout.java:343)
at featurecat.lizzie.gui.LizzieMain.invalidLayout(LizzieMain.java:500)
at featurecat.lizzie.gui.LizzieMain.updateStatus(LizzieMain.java:563)
at featurecat.lizzie.gui.LizzieMain.refresh(LizzieMain.java:525)
at featurecat.lizzie.gui.LizzieMain.refresh(LizzieMain.java:510)
at featurecat.lizzie.analysis.Leelaz.parseLine(Leelaz.java:357)
at featurecat.lizzie.analysis.Leelaz.read(Leelaz.java:523)
AWT-EventQueue-0 is stopped just before synchronized (this)
in getBestMoves().
at featurecat.lizzie.analysis.Leelaz.getBestMoves(Leelaz.java:794)
at featurecat.lizzie.gui.BoardRenderer.drawBranch(BoardRenderer.java:596)
at featurecat.lizzie.gui.BoardRenderer.draw(BoardRenderer.java:148)
at featurecat.lizzie.gui.BoardPane.paintComponent(BoardPane.java:190)
...
at featurecat.lizzie.gui.LizzieMain.paint(LizzieMain.java:310)
pool-1-thread-1 is stopped just before synchronized (this)
in getBestMoves().
at featurecat.lizzie.analysis.Leelaz.getBestMoves(Leelaz.java:794)
at featurecat.lizzie.gui.LizzieMain$5.run(LizzieMain.java:281)
First I rolled back my commit and tried to get the deadlock again. Actually it was not that difficult. I had to start the engine, load couple of SGF files and then start an auto analysis. I was able to take a thread dump this time. I put it here https://pastebin.com/XzHtZftL
I did try to reproduce the deadlock that you mention. As my window manager does not have a good toggle for that then I did not succeed. Which do you use? I have all the OSes and I can try.
I did discover that the initial startup without any configuration file ends in a disastrous end user experience with a NullPointerException on the featurecat/lizzie main branch. I'll see if my version has the same later and see if I can come up with a pull request. Of course I see that there are 37 unmerged pull requests so I don't know if I'll bother.
I again expect that #782 may fix your deadlock as it also takes care of GtpConsolePane.addText. Would you test it if possible?
The key point of my deadlock is refreshing of the screen just before the first response to lz-analyze
by the engine. So any resizing of the window at startup can cause the same deadlock and any successive resizing will increase its probability.
See #746 for NPE at the initial startup. You can try #798 to check whether a certain issue is already fixed.
I tried the https://github.com/kaorahi/lizzie/tree/for777_EDT and I was not able to reproduce the deadlock.
I experienced yet another deadlock in getBestMoves when I tried auto-analysis on master (ebeb4469) with KataGo. #782 seems to fix it as LizzieMain.refresh is moved into EDT.
"AWT-EventQueue-0":
at featurecat.lizzie.analysis.Leelaz.getBestMoves(Leelaz.java:767)
- waiting to lock <0x00000007474bb110> (a featurecat.lizzie.analysis.Leelaz)
at featurecat.lizzie.gui.BoardRenderer.drawBranch(BoardRenderer.java:592)
at featurecat.lizzie.gui.BoardRenderer.draw(BoardRenderer.java:147)
at featurecat.lizzie.gui.SubBoardPane.paintComponent(SubBoardPane.java:119)
at javax.swing.JComponent.paint([email protected]/JComponent.java:1074)
at javax.swing.JComponent.paintChildren([email protected]/JComponent.java:907)
- locked <0x0000000746d6b128> (a java.awt.Component$AWTTreeLock)
at javax.swing.JComponent.paint([email protected]/JComponent.java:1083)
...
"pool-3-thread-1":
at java.awt.Component.setFont([email protected]/Component.java:1933)
- waiting to lock <0x0000000746d6b128> (a java.awt.Component$AWTTreeLock)
at java.awt.Container.setFont([email protected]/Container.java:1777)
at javax.swing.JComponent.setFont([email protected]/JComponent.java:2769)
at featurecat.lizzie.gui.CommentPane.drawComment(CommentPane.java:125)
at featurecat.lizzie.gui.LizzieMain.updateStatus(LizzieMain.java:542)
at featurecat.lizzie.gui.LizzieMain.refresh(LizzieMain.java:514)
at featurecat.lizzie.gui.LizzieMain.refresh(LizzieMain.java:504)
at featurecat.lizzie.rules.Board.nextMove(Board.java:773)
- locked <0x00000007474d2b38> (a featurecat.lizzie.rules.Board)
at featurecat.lizzie.rules.Board.bestMoveNotification(Board.java:1425)
at featurecat.lizzie.analysis.Leelaz.notifyBestMoveListeners(Leelaz.java:885)
- locked <0x00000007474bb110> (a featurecat.lizzie.analysis.Leelaz)
at featurecat.lizzie.analysis.Leelaz.parseLine(Leelaz.java:373)
...
I see that #782 touches on 10 files and at the same time looking that getBestMoves only returns a list then maybe easier if we remove the syncing on that method and resolve a lot of deadlocks?
On the other hand if this little syncing let's you find places where invokeLater should be used then maybe it is a good canary :)
Yeah. Your fix will simply resolve all the above deadlocks (and more undiscovered ones) at once. :)
Have you tried toomasr@72a235c sufficiently long without any problem? If so, I'd like to merge it into my patched_0.7.4 (#798).
I analyse about 5-10 games a week. So far no side effects.
Overall one would sync if they are afraid there will be changes at the same time to the data structure but as it is a getter then I don't see this lock adding anything. The sync should happen when modifying the list. Sometimes you sync in a getter if there is some initialisation logic happening but again this is not the case. These are the reasons why I believe removing syncing from the getter is a safe move. Of course I don't know the code base and there might be surprises :)
I agree. It seems to protect nothing!
Hmm, this synchronized
exists from the very beginning in an early version of Lizzie (ff426b57). Anyway, I guess no one understand the entire code of Lizzie now and we have little hope to get a solid answer.
Leelaz.read()
is asynchronous and it isn't on Swing EDT, but Leelaz.parseLine(String)
calls many Swing methods.
To resolve that I wrapped parseLine
with SwingUtilities.invokeLater
. That fixed hang at resizing lizzie window.
(https://github.com/zakki/lizzie/commit/7e92d74ec3d53637273d7f9ece3929a00fb78df7)
So we have three patches now. :)
- (A) #782: move GUI methods into EDT. (merged to #798)
- (B) toomasr@72a235c: delete useless
synchronized
. (merged to #798) - (C) zakki@7e92d74: move GTP parser into EDT.
my opinion:
Even if any one of them is enough for the current issue, we may like to apply all the three patches in the hope of robustness against careless changes in future developments. But it will be better to test (C) for a while before merging it.
@zakki, has your zakki@7e92d74 been working without any trouble? I'd like to merge it into #798.
@kaorahi It works for me, but I don't use lizzie heavily (~ a hour per month). That branch contains experimental commit and revert commit like https://github.com/zakki/lizzie/commit/281db78b891074d9c761c79a3b830f2fdf26d430 , so please cherry-pick https://github.com/zakki/lizzie/commit/7e92d74 if you mind.
thx. Then it may be safer to cherry-pick zakki@7e92d74 after 0.7.5.
I cherry-picked zakki/lizzie@7e92d74 into #798. thx