raiden Node-local reveal_timeout also assumed for partner

The reveal timeout is a node-local configuration parameter, that is firstly important to check the sanity of incoming transfers. But this parameter is never communicated to our partner, the potential payer of a non-sane payment.

As a node, on initialisation of a NettingChannel from a blockchain-event, we assume our reveal_timeout to be the same as the reveal_timeout that the partner is using: https://github.com/raiden-network/raiden/blob/63ddd6c6e53e140d8ecea0acc4d43760e99887a0/raiden/blockchain/decode.py#L323-L329

This can lead to wrong reasoning about the code, e.g. this https://github.com/raiden-network/raiden/issues/4998#issuecomment-540590025 assumes that the payer will check payment validity from the payee's perspective before initiating the payment:

https://github.com/raiden-network/raiden/blob/f544ceb2a86a98872cb034fe6d626f454293669a/raiden/transfer/channel.py#L250-L253 where a sane payment is among other properties given when reveal_timeout < lock_timeout <= settle_timeout. But here, the reveal_timeout should actually be our partners reveal_timeout, since she is receiving the payment.

Since most nodes don't use custom reveal_timeouts, this parameter is synchronised between channel participants, but when e.g. the partner uses a bigger timeout than we do - we assume the payment to be sane - while the partner will reject it, causing a channel where payments can often fail.

~~This effect is somewhat buffered "globally" by the PFS, since it is the only other party that knows a nodes reveal_timeout and uses this in path-calculation.~~

Not sure wether this is that critical, but please participate in an open discussion.

Possible fix:

We could add p2p-communication of the reveal_timeout between partners.

Relevant issues:

https://github.com/raiden-network/raiden/issues/6704
https://github.com/raiden-network/raiden/issues/6645
https://github.com/raiden-network/raiden/issues/864

Dec 15 '20 11:12 ezdac

We are strictly following the route returned by the PFS, which does know about the reveal_timeouts. So the problem case should never occur when using a PFS.

The only problem I can see right now is that a direct channel could seem usable while it is not, due to an increased reveal_timeout on the partner's node. That case is a problem, but since this is a rare case with a somewhat clear failure mode, I would rate it as low-priority.

Dec 15 '20 13:12 karlb

The PFS knows about it, but it does not enforce relative reveal_timeout synchronisation between the nodes. Also, the Lock-expiration is not chosen by the PFS (please correct me if I'm wrong). The PFS only enforces that the gap between settle_timeout and the (smaller) reveal_timeout is at least once the settle_timeout ( or stated differently it enforces settle/reveal >= 2).

Imagine both nodes operate with low reveal timeouts, or a hight settle timeout. Now the payer has a slightly smaller reveal-timeout than the payee. The payer is a LC, wanting to operate with low margins between reveal_timeout and lock-expiration (see https://github.com/raiden-network/team/issues/891).

---rt_payer--------lock_expiration-------rt_payee----------...--------settle_timeout

So as soon as the lock-expiration is between reveal_timeout (payer) and reveal_timeout (payee) we run into problems. This is the aforementioned scenario that is problematic, since the payer doesn't learn why the payments fail. But this could happen even with the PFS-paths.

I could be wrong with my assumptions, and I still don't know how likely this scenario is.

I just think we should be careful with this, since the LC is planning to operate with low margins, and the reveal_timeout is assumed to be dynamic under changing congestion-load.

Dec 15 '20 14:12 ezdac

Is the described case actually problematic, or is it just the check that is overly strict? Maybe we should just check that lock_timeout > REVEAL_TIMEOUT_MIN to discard obviously failing payments? Since we can't guarantee any latency maximums, we can't avoid all cases of payment failure. Especially if we allow a dynamic reveal timeout. I would like to avoid communicating ever changing reveal timeouts between nodes.

Dec 15 '20 14:12 karlb

If by REVEAL_TIMEOUT_MIN you mean this https://github.com/raiden-network/raiden/blob/f544ceb2a86a98872cb034fe6d626f454293669a/raiden/constants.py#L146 I think this would be equivalent to making the reveal_timeout a chain-dependent constant as determined by the node implementation, since the user defined reveal_timeout is not used anywhere else?

Dec 15 '20 16:12 ezdac

I think this would be equivalent to making the reveal_timeout a chain-dependent constant as determined by the node implementation, since the user defined reveal_timeout is not used anywhere else?

It should still be used when checking whether to accept a transfer or not. I only suggest relaxing the check before sending a transfer.

Jan 04 '21 12:01 karlb

Regarding loosening transfer initiation checks - in my opinion the problem is that the initiation check is actually too loose (but we are missing information to make that check more strict), and it would depend on the reveal_timeouts of all nodes along a possible path.

The goal is for the initiator to choose a lock_timeout that complies to:

lock_timeout > max(node.reveal_timeout for node in path)

so that it is guaranteed that the check for accepting an incoming mediated transfer is positive for all nodes across the path.

Current state:

The initiator choses any lock_timeout and hopes that the checks go through across all nodes.
The initiator will not be informed wether the decision of the lock_timeout was sane, because the transfer just blocks somewhere along the path and expires

Solution 1:

The initiator choses a lock_timeout close to the settle_timeout, making the silent failure more unlikely

Solution 2:

The initiator "guesses" the lock_timeout, and the PFS filters the paths that don't comply to the aforementioned check along the path
If no path is returned, the node increments the lock_timeout and tries again

Solution 3:

The PFS does not filter, but additionally returns the max_reveal_timeout along all suggested paths (see https://github.com/raiden-network/raiden/issues/6704#issuecomment-746473825), so that the initiator can make informed decision about correcting the lock_timeout

Solution 4

All nodes synchronise their reveal_timeouts (e.g. broadcast, or polling endpoint at PFS)

In general, this problem is only relevant as soon as we see highly individual reveal-timeouts across the network.

If it comes to this, I would vote for Solution 3.

Solution 4 sounds very unreasonable, but it would have the advantage that nodes would know exactly when channels with partners are incompatible with each other and could be ignored in general to free up locked funds. Incompatibility could be due to highly incompatible node defaults (e.g. one partner always wants very low lock_timeouts, while the other assumes slow congestion and thus very high reveal_timeouts). Since this sounds like an edge-case, this could also be solved with solution 3 and additional heuristics, so that channels with partners will be closed, e.g. when they are not part of the last X valid paths.

Jan 18 '21 14:01 ezdac

raiden raiden copied to clipboard

Node-local reveal_timeout also assumed for partner

Possible fix:

Relevant issues:

raiden
raiden copied to clipboard