go-nitro icon indicating copy to clipboard operation
go-nitro copied to clipboard

Engine crashes if virtual channel is closed too quickly

Open lalexgap opened this issue 3 years ago • 7 comments

We can run into this error when running our testground test:

cannot transfer ownership of channel to from objective VirtualFund-0x86e2d9f554f8529230e8431b9757a1a0ce10417481d993c1741c3a82c1cbacda to VirtualDefund-0x86e2d9f554f8529230e8431b9757a1a0ce10417481d993c1741c3a82c1cbacda

This is due to one client still having an active VirtualFundObjective when they receive a VirtualDefundObjective. When the client attempts to store the VirtualDefundObjective in the store, the objective tries to take ownership of the virtual channel. However since the VirtualFundObjective is active it still has ownership on the channel. This results in an error occurring.

This can easily occur if one client receives messages in a different order than in which they were sent.

lalexgap avatar Jun 08 '22 16:06 lalexgap

Since only one objective can own a ledger at a time things blow up.

Do you mean, only one objective can own a virtual channel at a time? As discussed at standup today, ledger channels are expected to "feature" in multiple objectives (for different virtual channels) concurrently.

This is roughly what I think is happening:

I suppose this scenario is less likely to happen when there is a delay between virtual funding and defunding, as we would expect when there are some payments being sent. Of course we still need to handle it.

A couple of options we discussed at standup: 1.. Enqueue the VirtualDefund objective proposal. When VirtualFund is finished, dequeue the VirtualFund proposal and proceed as normal. 2. Reject the VirtualDefund objective. Alice is notified and will try again later. 3. Reject the VirtualFund objective. Since there is an isFinal state in the wild, there is no point continuing with VirtualFund. VirtualDefund trumps VirtualFund (as long as it passes verification).

It occurs to me now that the more realistic scenario is that a payment arrives before VirtualFund has completed. We wouldn't want to reject that (as in option 2).

geoknee avatar Jun 09 '22 16:06 geoknee

Since only one objective can own a ledger at a time things blow up.

Do you mean, only one objective can own a virtual channel at a time? As discussed at standup today, ledger channels are expected to "feature" in multiple objectives (for different virtual channels) concurrently.

That's correct! I confused myself with ownership of virtual channels and ledger channels. The issue is that only one objective can own the virtual channel

lalexgap avatar Jun 09 '22 17:06 lalexgap

I suppose this scenario is less likely to happen when there is a delay between virtual funding and defunding, as we would expect when there are some payments being sent. Of course we still need to handle it.

I've introduced a minimum delay between creating the virtual channel and closing it in the testground test. (I think I had tried this before but I didn't properly implement the delay). With this change it looks like the integration test avoids running into this issue.

lalexgap avatar Jun 09 '22 17:06 lalexgap

What's the correct long-term behaviour for this? How about we let the defund objective forcefully take over from the fund objective? Is there any point in the client in question proceeding to try and fund the channel when they know at least one peer is trying to defund it?

geoknee avatar Sep 21 '22 09:09 geoknee

Is there any point in the client in question proceeding to try and fund the channel when they know at least one peer is trying to defund it?

I don't think this scenario would occur. A peer can only start virtual defunding once the virtual funding objective is complete. So the peer would only start virtual defunding once they have received signed post fund setups from everyone to complete virtual funding.

What can happen now is that we happen to receive the virtual defund message before we receive their post fund setup from a peer. Since the virtual funding objective still has ownership on the virtual channel our store errors and the engine panics.

We only make between 1 and 5 payments before attempting to close the virtual channel so virtual channels are pretty "short-lived", which is why we run into this.

For now we avoid this by sleeping before closing a virtual channel. This ensures that our peers have a had a chance to receive our post fund setups and finish their virtual funding objectives.

What's the correct long-term behaviour for this?

In the short term it's probably good enough to catch this scenario and report an error. Right now the whole engine panics when this happens.

Longer term I think we'd want something like Option 1 we discussed here. We should queue up the virtual defund objective and start cranking after the virtual fund objective is complete.

lalexgap avatar Sep 21 '22 20:09 lalexgap

Yeah I think sleeping is not the long term solution. I still think that abandoning the virtual fund objective might be the right thing to do in this case, if we think about the semantic meaning of each of those messages coming in. Shall we discuss at standup today?

geoknee avatar Sep 22 '22 09:09 geoknee

Discussed in standup today: abandoning the virtualfund objective is not a good approach, since validating a proposal to virtually defund requires inspecting valid vouchers received, which depends on having the postfund signatures (i.e. completing virtualfund). Therefore the best approach is to queue up the virtualdefund objective and make sure it starts executing once the virtualfund objective completes.

geoknee avatar Sep 22 '22 16:09 geoknee