bisq Unsynced DAO state forcing to restart Bisq in order to fix sync

Unsynced DAO state forcing to restart Bisq in order to fix sync

Open w0000000t opened this issue 2 years ago • 6 comments

As per today's support weekly call the issue of "untakeable BSQ swap offers" reported by @jmacxx was further discussed. He explained how there is a limit to the amount of DAO data that peers are "allowed" to download, and swaps introduce a significant data traffic increase, that often will lead to a peer not downloading all the available data, thus resulting out of consensus. This, in turn, will prevent a swap maker to accept takers, or a swap taker to be accepted by makers. The solution would be to resync DAO state, and what was before an "annoying popup" that got disabled, is being reinstated to this purpose, with the risk of resuming the flow of support requests regarding the aforementioned popup.

My take on this issue, from my understanding, is that the restart of the application will allow Bisq to "resume" the download of the additional data missing from the previous sync attempt, thus restoring the sync state; when the missing data is significant, more consecutive restarts will be prompted by the popup until sync is finally achieved. Regarding this, why is it needed for the application to be restarted? Is it not possible to have Bisq, on one side, limiting the amount of downloaded data in one go, and on the other side, resume at a later moment (after N minutes, for example) downloading the next batch of allowed data, until sync is finally complete, in a transparent way for the user, and especially not requiring a restart?

As an alternative, if the above is not technically feasible, it would be a nice addition to make the "annoying unsynced DAO popup" be manageable by the user. For example, briefly explain what happened, that it might affect the user's ability to participate in swaps (and if this is needed, it's necessary to restart Bisq until sync is successful), and that you can ignore the error if you are not interested in swaps; additionally, have a checkbox to "not show this again".

Feb 23 '22 20:02 w0000000t

The significant data increase of account age witness records was a bug, fixed by https://github.com/bisq-network/bisq/pull/5974. So that may have settled down by now, and it could be that we're back to the previous "normal" rate of consensus errors.

The reminder popup was coded to show a maximum of once per bisq session; if we put a "do not show again" option, it would defeat the purpose which is to get all nodes in consensus. I agree though, wording could be made nicer to explain cause and effect as was done in a related way here: https://github.com/bisq-network/bisq/pull/6063

The other questions @chimp1984 is best suited to answer.

Feb 23 '22 21:02 jmacxx

Agree with @jmacxx The data limitation issues should be fixed by now. In the logs it can be seen if that is an issue, but I highly doubt. I think there is a bug in the snapshot handling and/or peristence for DAO data. With separating DAO blocks and DaoState we introduced more risk that DAO data gets out of sync (before it was all in the DaoState, thus lower risk but it became a scalability problem).

The bug is not trivial to find. I would suggest to add lot of logs into that code area and hopefully it will help to reveal where the issue gets caused. A very critical code review about all that code paths might help as well. I am offline the next days....

Feb 23 '22 21:02 chimp1984

[edit - removed most of this status report because some of the results turned out to be caused accidentally as a side effect of some diagnostic code I had inserted in an attempt to observe some statuses. ]

~~"DAO state chain not connecting with the new data"~~ (turned out to be a false alarm)

Other observations:

If you compare a known good hashchain against peers going back further you see a variable pattern of several hundred matching - not matching - matching hashes. The not matching parts are always ones where the peer self-generated the hash, and the matching parts are ones where the hash was from seednode.

Still investigating.

Mar 04 '22 20:03 jmacxx

One reproducable error came to light from a user experiencing an issue in the support channel. Installing bisq as a new user about 6 weeks after the latest release will consistently produce a DAO state which is out of consensus. Steps to reproduce:

Delete the local/share/Bisq directory (i.e. starting with a completely empty data directory).
Run the Bisq release from the previous month (at the time of writing, v1.8.2).
6000 BsqBlocks are received from the seednode advancing the height from 719240 to 725240.
No subsequent request for BsqBlocks is made, and the chain remains un-synced indefinitely.

This is due to L237 in LiteNode.java which does not request subsequent blocks if the BitcoinJ chain is still downloading.

If you allow the BitcoinJ chain to complete its sync and then perform the same test a different error presents itself:

With a fully synced wallet, stop Bisq and delete the Bisq/btc_mainnet/db directory.
Run the Bisq release from the previous month (at the time of writing, v1.8.2).
6000 BsqBlocks are received from the seednode advancing the height from 719240 to 725240.
The remainder of the BsqBlocks are requested and received advancing the height to current.
Lots of red warnings appear in the log when processing the BsqBlocks.
The DAO network status indicates it is out of sync with seednodes and needs to be rebuilt.

The errors seem to indicate that the BlindVoteStore data file received from the seednode is missing data.

We have a blindVoteTx but we do not have the corresponding blindVote payload
We could not find a list which matches the majority so we cannot calculate the vote result. Please restart and resync the DAO state.

I think this may be that the seednode had to truncate the GetDataResponse payload due to too many AccountAgeWitness and/or BlindVotePayload records.

1552038 bytes : BlindVoteStore size first time 1595935 bytes : BlindVoteStore size second and subsequent times.

The same error can be produced without deleting the whole data directory, just AccountAgeWitness and BlindVotePayload.

Mar 12 '22 05:03 jmacxx

One reproducable error came to light from a user experiencing an issue in the support channel. Installing bisq as a new user about 6 weeks after the latest release will consistently produce a DAO state which is out of consensus. Steps to reproduce:

Delete the local/share/Bisq directory (i.e. starting with a completely empty data directory).

Run the Bisq release from the previous month (at the time of writing, v1.8.2).

6000 BsqBlocks are received from the seednode advancing the height from 719240 to 725240.

No subsequent request for BsqBlocks is made, and the chain remains un-synced indefinitely.

This is due to L237 in LiteNode.java which does not request subsequent blocks if the BitcoinJ chain is still downloading.

If you allow the BitcoinJ chain to complete its sync and then perform the same test a different error presents itself:

With a fully synced wallet, stop Bisq and delete the Bisq/btc_mainnet/db directory.

Run the Bisq release from the previous month (at the time of writing, v1.8.2).

6000 BsqBlocks are received from the seednode advancing the height from 719240 to 725240.

The remainder of the BsqBlocks are requested and received advancing the height to current.

Lots of red warnings appear in the log when processing the BsqBlocks.

The DAO network status indicates it is out of sync with seednodes and needs to be rebuilt.

The errors seem to indicate that the BlindVoteStore data file received from the seednode is missing data.
We have a blindVoteTx but we do not have the corresponding blindVote payload
We could not find a list which matches the majority so we cannot calculate the vote result. Please restart and resync the DAO state.
I think this may be that the seednode had to truncate the GetDataResponse payload due to too many AccountAgeWitness and/or BlindVotePayload records.

1552038 bytes : BlindVoteStore size first time 1595935 bytes : BlindVoteStore size second and subsequent times.

The same error can be produced without deleting the whole data directory, just AccountAgeWitness and BlindVotePayload.

And it doesn't matter how often you restart the node to get everything in-sync?

Mar 14 '22 08:03 ripcurlx

And it doesn't matter how often you restart the node to get everything in-sync?

In the tests I've done so far, it has not gone back into sync after restarting, only when the user explicitly rebuilds the DAO state. I tried waiting 30 blocks in case the snapshot process somehow would fix it, but it did not. The DAO state of the v1.8.2 loaded as in the example above is missing 4 param changes and 675 spent BSQ tx.

About 33% of all network nodes currently have their most recent DAO hashes not in consensus.

Mar 15 '22 02:03 jmacxx

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 13 '23 00:09 github-actions[bot]

This issue has been automatically closed because of inactivity. Feel free to reopen it if you think it is still relevant.

Sep 21 '23 00:09 github-actions[bot]

bisq bisq copied to clipboard

Unsynced DAO state forcing to restart Bisq in order to fix sync

bisq
bisq copied to clipboard