neo-go icon indicating copy to clipboard operation
neo-go copied to clipboard

NPE in WebSocket client subscriptions

Open cthulhu-rider opened this issue 1 year ago • 1 comments

caught suddenly while debugging https://github.com/nspcc-dev/neofs-node/pull/2444, don’t have a special context. I subscribe to new blocks and notary requests

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x7bc119]

goroutine 14 [running]:
github.com/nspcc-dev/neo-go/pkg/rpcclient.(*WSClient).notifySubscribers(0xc000224480, {0x16?, {0xa11180?, 0xc000283ea0?}})
	/home/ll/go/pkg/mod/github.com/nspcc-dev/[email protected]/pkg/rpcclient/wsclient.go:599 +0x299
github.com/nspcc-dev/neo-go/pkg/rpcclient.(*WSClient).wsReader(0xc000224480)
	/home/ll/go/pkg/mod/github.com/nspcc-dev/[email protected]/pkg/rpcclient/wsclient.go:513 +0x545
created by github.com/nspcc-dev/neo-go/pkg/rpcclient.NewWS
	/home/ll/go/pkg/mod/github.com/nspcc-dev/[email protected]/pkg/rpcclient/wsclient.go:422 +0x638

may be i violated some strict requirement on my side, but didn't find it after a quick look

cthulhu-rider avatar Aug 17 '23 09:08 cthulhu-rider

The issue is caused by the fact that subscriptions map of WS client isn't in sync with receivers map (although it always must be). And currently I can't find where the bug is and also can't reproduce the panic. It's possible that concurrency issue occurs between subsequent event receiving and unsubscription.

AnnaShaleva avatar Oct 19 '23 16:10 AnnaShaleva

100ms long blocks make it fail often.

carpawell avatar Jul 16 '24 16:07 carpawell

@roman-khimov, @AnnaShaleva, @AliceInHunterland, there is some race-by-design currently and I am not sure what can be done here (mostly because I want to make a fix small and reliable, no big changes). The key reason is that the subscription lock is always taken after RPC is done, (e.g. https://github.com/nspcc-dev/neo-go/blob/f0ae14e4db9c26da2ee5686e51edf4c248647ce3/pkg/rpcclient/wsclient.go#L774-L779) and it cannot control the sequence of the client's call.

This became an often issue when we started to wait for some TXs in our test runs: https://github.com/nspcc-dev/neofs-node/blob/24fc815cd4b262b625e24bac62259325aba3444c/pkg/morph/client/meta.go#L32 cause it uses subscriptions per every call: https://github.com/nspcc-dev/neo-go/blob/f0ae14e4db9c26da2ee5686e51edf4c248647ce3/pkg/rpcclient/waiter/waiter.go#L247-L260 (per every object put which can be often and can be concurrent).

Every client's (Un)Subscription call does two things: an RPC call and a subscription map lock (two of maps currently). If we imagine that there is one routine that tries to subscribe (A) and one routine that tries to unsubscribe (B), the following sequence can happen:

  1. Current number of subscriptions is X
  2. B does an RPC and makes number of subscriptions X-1
  3. A does an RPC and makes number of subscriptions X again
  4. A holds subscription locks and rewrites client's subscription state (subscription with ID X now points to a different channel; channel that was registered by B is lost and is not related to any real subscription but is still included in the receivers map)
  5. B holds subscription locks and drops subscription X (first, it is an error and we have just lost a subscription that we think was made successfully second, we have lost a channel in the receivers map, and no corresponding subscription points to it)
  6. X subscription is received by the WS client (in practice it is a new block, 100ms, quite often to be sure this issue happens every hour), we range through the receivers, see no corresponding subscription, and panic: https://github.com/nspcc-dev/neo-go/blob/f0ae14e4db9c26da2ee5686e51edf4c248647ce3/pkg/rpcclient/wsclient.go#L696

1

The easiest solution is to put subscription RPC inside Lock/Unlock space so no conflicts will ever happen. but I am not sure this is what is expected to be done and also performance may decrease (but it seems a real solution that fits the current design with non-monotonic subscription IDs).

2

The second idea is to make the subscription ID be smth like the request ID (the later subscription call is made the bigger its ID is always). I am not sure it may happen, cause there is some logic about the subscription numbers: https://github.com/nspcc-dev/neo-go/blob/f0ae14e4db9c26da2ee5686e51edf4c248647ce3/pkg/services/rpcsrv/server.go#L2776-L2799

3

Also, I think it is possible to make every subscription completely unique and not be a simple number. This may be some hash or time-dependent thing, or requests-dependent, is not important, it just has to be unique for every subscription and has to make conflicts impossible.


Thoughts?

carpawell avatar Jul 24 '24 15:07 carpawell

Server can return any IDs it wants and client has to live with that, so any server-side "fix" is not appropriate.

Simple lock extension can affect event processing since it has to access the same map as well. What we need in fact is to have the same order of events as we have on the server. And it's not a simple thing since two concurrent requests can be served in any order by the server. Still, we do have some order in wsReader, we just need to leverage it.

roman-khimov avatar Jul 24 '24 15:07 roman-khimov

The other idea is to fetch the subscriber before unsubscribe and compare current one with it after the request, if it's the old one --- drop it, if it's something else --- OK, we're not touching it. Subscription always overwrites values anyway and that's correct.

roman-khimov avatar Jul 24 '24 19:07 roman-khimov