taproot-assets icon indicating copy to clipboard operation
taproot-assets copied to clipboard

Block RPC calls if not properly started or errored out during start

Open dstadulis opened this issue 1 year ago • 21 comments

In a recent user report, litd dysfunction was shown because the universe RPC call ran before the universe sub service was online.

See TODO summary here: https://github.com/lightninglabs/taproot-assets/issues/1122#issuecomment-2345928938

dstadulis avatar Aug 29 '24 16:08 dstadulis

Thanks for the issue report! Should we allow any rpc requests to the tapd before the universe sub-service has been started?

If they should be separated, so that we would want allow some requests prior to the universe sub-service, would an ok first version be to just disallow any tapd RPCs until the universe sub-service is up, and then create the separation at a later stage?

ViktorT-11 avatar Aug 29 '24 17:08 ViktorT-11

I don't think they should be separated, or that there would be any benefit to that.

IIUC the desired fix is to disallow all tapd RPCs until all tapd sub-services are up.

jharveyb avatar Aug 29 '24 18:08 jharveyb

LiT today should already take care of this. We only mark a subserver as "ready for calls" once the Start (for tap, this is the (s *Server) StartAsSubserver method) method of that subserver returns without an error. We want to keep lit as generic as possible in terms of handing of subservers and so it is expected for each subserver that it is ready to handle calls once that start method has returned and we really should not be doing extra calls to check if various subserver specifics are ready - the SLA here is that things should be ready once that method has returned.

ellemouton avatar Aug 30 '24 05:08 ellemouton

If yall dont want to permanently change the behaviour, then I suggest adding a "BlockTillUniverseStart" functional option or something on that call

ellemouton avatar Aug 30 '24 05:08 ellemouton

Hmm, I see. That is the correct behavior as far as I can see. We only have a screenshot of a stack trace to go from and it looked like an RPC request did come through before the internal gRPC server in tapd was started.

But perhaps tapd stopped with an error and the request came through after the gRPC server was stopped? Though from just the code it looks like that's handled correctly as well.

Need more info from the user then, preferably actual logs.

guggero avatar Aug 30 '24 06:08 guggero

Reposting the stack trace as text:

2024-08-29 11:27:64.591 [DBG] RPCS: [/Lnrpc.State/GetState] requested

2024-08-29 11:27:64.682 [DBG] RPCS: [/Lnrpc.State/GetState] requested

2024-08-29 11:27:04.690 [DBG] RPCS: [/universerpc.Universe/ListFederationServers] requested

2024-08-29 11:27:64.692 [DBG] RPCS: [/Lnrpc.Lightning/ChecktacaroonPermissions] requested

2024-08-29 11:27:64.692 [DBG] RPCS: [/Lnrpc.Lightning/ListPeers] requested

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=@x1 addr=0x38 pc=9x16c62a0]

goroutine 3224 [running]:

github. com/lightninglabs/taproot -assets. (*rpcServer).ListFederationServers(@x0?, {@x32b0ad0?, 0x40026efc50?}, @x1bb10407)
github.con/Lightninglabs/[email protected]/rpcserver .g0:5654 +8x20

github. com/lightninglabs/taproot -assets/taprpc/universerpc._Universe_ListFederationServers_Handler .func1({@x3Zb0ad0?, @x40026efc50?}, {@x1bb1040?,
github.con/Lightninglabs/taproot [email protected] -2110839696cb/taprpc/universerpc/universe_grpc.pb.90:688 +0xd0

github. com/lightningnetwork/Lnd/rpcperms.(*InterceptorChain) .CreateServerOpts. (*InterceptorChain) .middLewareUnaryServer Interceptor . func7({@x3zb0adé
github. com/lightningnetwork/[email protected] 1b353b0bfd58/rpcperms/interceptor .go:832 +0xec

‘google.golang.org/grpc.getChainunaryHandler .funct({9x32bGad0, @x40026ef C50}, {@x1bb1049, 6x49026eFcB0})
google. golang.org/[email protected]/server.go:1163 +0xa0

github. com/lightningnetwork/Lnd/rpcperms. (*InterceptorChain) .CreateServerOpts. (*InterceptorChain) .MacaroonUnaryServer Interceptor . func5({@x3Zb0ad0,
github. com/lightningnetwork/Lnd@[email protected] 1b353b0bfd58/rpcperms/interceptor .go:689 +0x90

google. golang.org/grpc.getChainUnaryHandLer . func1({@x32b0ad0, @x40026efc50}, {@x1bb1040, @x40026efc80})
google.golang.org/[email protected]/server.go:1163 +8xa0

github. com/lightningnetwork/Lnd/rpcperms. (*InterceptorChain) .CreateServerOpts. (*InterceptorChain) .rpcStateUnaryServer Interceptor . Func3({@x3Zb0ad0,
github. com/lightningnetwork/Lnd@vd. 18.0-beta.rc4.0.20240730143253-1b353bebfdS8/rpcperms/interceptor .go:781 +6x108

google. golang.org/grpc.getChainUnaryHandLer . func1({@x32b0ad0, @x40026efc50}, {@x1bb1040, @x40026efc80})
google.golang.org/[email protected]/server.go:1163 +8xa0

github. com/lightningnetwork/Lnd/rpcperms.(*InterceptorChain) .CreateServer0pts.errorLogunaryServer Interceptor . funcl({@x32b0ad0?, @x40026efc50?}, {0:
github. com/lightningnetwork/Lnd@v@. 18.0-beta.rc4.0.20240730143253- 1b353b0bfd58/rpcperms/interceptor .go:605 +0x48

google.golang.org/grpc.NewServer .chatnUnaryServerInterceptors.chainUnaryinterceptors.funci({@x32b0ad®, @x49026efC50}, {@x1bb1040, @x49026eFc8},
‘google.golang.org/[email protected]/server .go:1154 +0x88

github. com/lightninglabs/taproot -assets/taprpc/universerpc._Universe_ListFederationServers_Handler({@xleSe5a0, 0x400056e600}, {@x32b0ad0, 0x40026et
github.con/Lightninglabs/taproot [email protected] -2110839696cb/taprpc/untverserpc/universe_grpc.pb.g0:690 +9x148

‘google.golang.org/grpc. (*Server).processUnaryRPC(Gx40008541e9, {Gx32bGad0, @x49026eF99}, {6x32c2920, 0x4000588340}, 0x40026b19e0, 6x4900591c20,
google. golang.org/[email protected]/server .go:1343 +0xb40

google.golang.org/grpc. (*Server).handleStream(@x40008541e0, {0x32c2920, @x4000588340}, @x40026b19e9)
‘google. golang.org/[email protected]/server .go:1737 +8x95c

google. golang.org/grpc. (*Server).serveStreams. func1.1()
google. golang.org/[email protected]/server.go:986 +0x88

created by google.golang.org/grpc.(*Server).serveStreams.funcl in goroutine 3223
google. golang.org/[email protected]/server.go:997 +0x14c

jharveyb avatar Aug 30 '24 16:08 jharveyb

So today we have a middleware interceptor that'll bounce all calls until the server is fully started: https://github.com/lightninglabs/taproot-assets/blob/72b93f84a0afa08e01c99d582357672133fc9b20/rpcperms/interceptor.go#L380-L408

We then set to active after all the sub-systems have started here: https://github.com/lightninglabs/taproot-assets/blob/72b93f84a0afa08e01c99d582357672133fc9b20/server.go#L374-L378

From the trace, either the rpcserver or FederationDB was nil at that point, which is puzzling (all the sub-system structs have already been initialized at that point). So perhaps some mutation occurred somewhere?

Roasbeef avatar Aug 30 '24 17:08 Roasbeef

This issue is coming from our team. When the terminal is restarted while lnd and tapd are still in the process of syncing, the terminal crashes continuously due to receiving various requests from tapd (such as assets list, subscribeRfqEvents, universe).

lukegao209 avatar Aug 31 '24 10:08 lukegao209

@lukegao209 could you send us a full log (including the stack trace) from the beginning of a startup sequence until it panics?

guggero avatar Sep 03 '24 08:09 guggero

@lukegao209 also, what port are you issuing the ListFederationServers call on? The main litd port (8443) or the integrated tapd port (10029)?

guggero avatar Sep 09 '24 11:09 guggero

@lukegao209 also, what port are you issuing the ListFederationServers call on? The main litd port (8443) or the integrated tapd port (10029)?

port 10029

lukegao209 avatar Sep 09 '24 13:09 lukegao209

@lukegao209 - ok cool thanks 🙏 This makes sense then - rather point your requests to Lit's port 8443 as then LiT will block calls to tapd until it is ready.

@Roasbeef re

So today we have a middleware interceptor that'll bounce all calls until the server is fully started:

This is actually not the case for the Subserver startup of Tap.

  • this is done for the main startup of Tap when it is in its own process: see RunUntilShutdown which sets up the interceptor chain for Tap which does this blocking
  • but for StartAsSubserver which LiT uses , the interceptor chain in Tap is not set up

ellemouton avatar Sep 09 '24 13:09 ellemouton

In conclusion, for the terminal, it should request lnd on port 10009; for other services (tapd, loop, pool, etc.), they should all request port 8443. ?

lukegao209 avatar Sep 09 '24 13:09 lukegao209

you can also point LND requests to 8443 👍

ellemouton avatar Sep 09 '24 13:09 ellemouton

thanks

lukegao209 avatar Sep 09 '24 13:09 lukegao209

btw , will terminal’s account support taproot assets channel?

lukegao209 avatar Sep 09 '24 13:09 lukegao209

btw , will terminal’s account support taproot assets channel?

not out of the box, no. we'll need to update the litd account system once taproot asset channels are fully implemented.

guggero avatar Sep 09 '24 14:09 guggero

If Taproot Assets are implemented through parsing the invoice’s custom_data, maybe I can start working on supporting it now.

lukegao209 avatar Sep 09 '24 15:09 lukegao209

@ellemouton

but for StartAsSubserver which LiT uses , the interceptor chain in Tap is not set up

Gotcha, ok that seems to be the core issue here. We should retain the interceptor chain for tapd, either using the same system, or a more explicit check.

Roasbeef avatar Sep 09 '24 23:09 Roasbeef

@guggero - with our latest offline discussion, do you rate we can close this and move it to the Tap repo?

ellemouton avatar Sep 12 '24 10:09 ellemouton

Yes. TODOs are:

  • tapd: return false in waitForReady if tapd was shut down or errored out
  • tapd: don't set initialized=true when startup (StartAsSubserver) fails
  • tapd: add server nil check to global error interceptors

Transferring to tapd repo.

guggero avatar Sep 12 '24 10:09 guggero