lnd icon indicating copy to clipboard operation
lnd copied to clipboard

unit test flake: neutrino panic when restarting rescan

Open guggero opened this issue 3 years ago • 4 comments

Found in unit test run:

--- FAIL: TestLightningWallet (7.74s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x7d47b7]

goroutine 6 [running]:
testing.tRunner.func1.2({0x9c34e0, 0xfdbfe0})
	/opt/hostedtoolcache/go/1.19.0/x64/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.19.0/x64/src/testing/testing.go:1399 +0x39f
panic({0x9c34e0, 0xfdbfe0})
	/opt/hostedtoolcache/go/1.19.0/x64/src/runtime/panic.go:884 +0x212
github.com/lightninglabs/neutrino.(*Rescan).Update(0x0, {0xc00064a248, 0x1, 0xc00129a4e0?})
	/home/runner/work/go/pkg/mod/github.com/lightninglabs/[email protected]/rescan.go:1391 +0x57
github.com/btcsuite/btcwallet/chain.(*NeutrinoClient).NotifyReceived(0xc0000fa120, {0xc0012bc060?, 0x1, 0x1})
	/home/runner/work/go/pkg/mod/github.com/btcsuite/[email protected]/chain/neutrino.go:457 +0x159
github.com/btcsuite/btcwallet/wallet.(*Wallet).NewAddress(0xc0001ce480, 0x0, {0x0?, 0x9f16e0?})
	/home/runner/work/go/pkg/mod/github.com/btcsuite/[email protected]/wallet/wallet.go:3020 +0x17a
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).NewAddress(0xc0005a8180, 0x50?, 0x0, {0xa88a60?, 0x7?})
	/home/runner/work/lnd/lnd/lnwallet/btcwallet/btcwallet.go:556 +0x10c
github.com/lightningnetwork/lnd/lnwallet/test.loadTestCredits(0xc000078720, 0xc000340380, 0x14, 0x0?)
	/home/runner/work/lnd/lnd/lnwallet/test/test_interface.go:275 +0x287
github.com/lightningnetwork/lnd/lnwallet/test.createTestWallet({0xc00031ff80?, 0xa87ad1?}, 0x5?, 0xfe8080, {0xc43560, 0xc0001ca180}, {0xc48820, 0xc0005a8180}, {0xc43b00, 0xc0011c41c8}, ...)
	/home/runner/work/lnd/lnd/lnwallet/test/test_interface.go:378 +0x4a9
github.com/lightningnetwork/lnd/lnwallet/test.runTests(_, _, {_, _}, _, {{0xc00002eba0, 0xe}, {0xa873bc, 0x2}, {0xa879c6, ...}, ...}, ...)
	/home/runner/work/lnd/lnd/lnwallet/test/test_interface.go:3398 +0x2633
github.com/lightningnetwork/lnd/lnwallet/test.TestLightningWallet(0xc000007ba0, {0xa89361, 0x8})
	/home/runner/work/lnd/lnd/lnwallet/test/test_interface.go:3064 +0x8ef
github.com/lightningnetwork/lnd/lnwallet/test/neutrino_test.TestLightningWallet(0x0?)
	/home/runner/work/lnd/lnd/lnwallet/test/neutrino/neutrino_test.go:12 +0x25
testing.tRunner(0xc000007ba0, 0xb9f9b8)
	/opt/hostedtoolcache/go/1.19.0/x64/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
	/opt/hostedtoolcache/go/1.19.0/x64/src/testing/testing.go:1493 +0x35f
FAIL	github.com/lightningnetwork/lnd/lnwallet/test/neutrino	7.770s
FAIL

Probably happens because the mutex is unlocked here: https://github.com/btcsuite/btcwallet/blob/master/chain/neutrino.go#L351 And afterward the s.rescan is set to nil but s.scanning is still true.

guggero avatar Oct 05 '22 13:10 guggero

Should this be opened as an issue against btcwallet? I don't see a fix in lnd that would address the root cause. It seems that the state management in the chain package for a NeutrinoClient in btcwallet is to blame.

Problem

The NeturinoClient Rescan method locks, sets rescan to nil then unlocks itself before resetting the value. As the client state is scanning = true yet rescan = nil when the lock is released there is a race between the executing Rescan method and any NotifyReceived calls that are waiting to obtain the lock. If a NotifyReceived call picks up the lock in this state, it will panic when it tries to run rescan.Update as scanning is true.

Solution 1

Make a change against btcwallet to address the state management issues:

  1. make the entire Rescan method a critical path (i.e., only unlock via a single defer call)
  2. make the entire NotifyReceived method a critical path
  3. remove the locking in the notificationHandler to prevent a deadlock caused by making Rescan completely critical. This locking isn't necessary as it attempts to read the state of the rescanErr channel. But if rescanErr is nil, then there is no Rescan object created yet so no errors to read. Also, reading from a nil channel will simply block but the read is wrapped in a select statement with other exits.

Risks

  • there are no unit tests on the neutrino client in btcwallet and even a targeted change like this is hard to know if it introduced any regressions

MStreet3 avatar Oct 13 '22 21:10 MStreet3

Yes, we should probably open the same issue in Neutrino. I just thought it would get more visibility when opening it in lnd (and also because the actual test that fails is in lnd, I don't think this is covered by unit tests in the Neutrino repo).

guggero avatar Oct 14 '22 09:10 guggero

~~made changes to btcwallet here: https://github.com/btcsuite/btcwallet/pull/820~~ see https://github.com/btcsuite/btcwallet/pull/822

MStreet3 avatar Oct 15 '22 22:10 MStreet3

the btcwallet change is now here: https://github.com/btcsuite/btcwallet/pull/822

and the lnd temporary change that runs itests with the above branch from mstreet3 repo is here: https://github.com/lightningnetwork/lnd/pull/7049

MStreet3 avatar Oct 27 '22 16:10 MStreet3