bolts TLV failure message and recommended length to 1024

Extend the failure message specification with a tlv stream in line with the onion tlv payload and p2p messages.

Sep 02 '22 09:09 joostjager

Concept ACK. Note that this doesn't provide a lot of bytes for the tlv stream though, since we're limited to 256 bytes for the failure payload, but it's better than nothing.

An important point to note is that the tlv stream mustn't be length-prefixed (same as our tlv extensions at the end of each lightning message). It's worth adding a test vector to make sure people get that right.

Sep 02 '22 14:09 t-bast

Yes, I can add a test vector. But I think I'll do it just for the failure message bytes excluding padding and encryption.

Sep 12 '22 14:09 joostjager

Added test vector to ensure that implementers won't prepend the tlv stream with a length.

Sep 12 '22 18:09 joostjager

IMHO it would really be more helpful to future readers to update the existing e2e test vector... I can take a stab at it when implementing this change for eclair.

Sep 12 '22 19:09 t-bast

Discussed in spec meeting: pair this change with an increase in the recommended failure message size. Any thoughts on the new size? I was thinking 1024, to bring it roughly in line with the forward pass payload?

Sep 13 '22 10:09 joostjager

1024 sounds good to me since we don't have a real use-case yet, as long as all implementations change their behavior to handle potentially longer messages in the future (this way we can change the spec recommendation later if a use-case comes up that requires a bigger length without having to update intermediate nodes).

Sep 13 '22 11:09 t-bast

So just to get it right, you're saying we don't change the recommendation yet? If that's indeed the intention, we have to be aware that a later increase allows for fingerprinting of the destination node. Perhaps this needs an activation date... All implementations switch to 1024 byte failure messages on Jan 1, 2023 :)

Sep 13 '22 12:09 joostjager

So just to get it right, you're saying we don't change the recommendation yet?

No, I think it's a good idea to change the recommendation to something bigger, and 1024 sounds good. I'm just saying we can potentially go much bigger (we're only restricted by the whole message being 65kB), but since we don't know yet how we're going to use it, a first step to 1024 is probably a good idea.

But we should also change the spec to say that forwarding nodes should be able to handle messages with a different length than the recommended one, to give us freedom to change it in the future without requiring a network-wide upgrade. Does that make sense?

Sep 13 '22 12:09 t-bast

Yes. Will also check to see what lnd does and does not support currently.

Sep 13 '22 12:09 joostjager

It seems that lnd forwards the failure message fine. The only problem is that the payment sender isn't able to handle any other length than 256. So if we'd add a long but optional tlv extension to the failure message today, senders running lnd would interpret that as an unreadable failure message with all the undesired consequences for reputation tracking.

The fix https://github.com/lightningnetwork/lnd/pull/6913 is small, but critical to unblock the usage of longer tlv failures.

Sep 13 '22 14:09 joostjager

Updated PR with 1024 byte tlv failure test vector.

For my own reference: patch used to generate vector https://github.com/lightningnetwork/lnd/compare/master...bottlepay:lnd:failure-message-test-vector and https://github.com/lightningnetwork/lightning-onion/compare/master...joostjager:lightning-onion:failure-message-test-vector

Sep 14 '22 07:09 joostjager

Outstanding issue for this PR is signaling the capabilities of the payment sender to interpret long failure messages.

Alternatively this PR can be split in two parts:

Add failure message TLV
Change recommended failure length to 1024.

Oct 28 '22 20:10 joostjager

Outstanding issue for this PR is signaling the capabilities of the payment sender to interpret long failure messages.

If senders follow the spec already (I know they don't) then this wouldn't be needed?

Oct 31 '22 11:10 halseth

Yes, that's correct. We wouldn't need it.

Oct 31 '22 11:10 joostjager

As mentioned on the call, I'm a bit dubious of the need for signaling today - it materially hurts privacy across the network, so unless we have a need to return long errors in the next year or two I don't think it's worth it. Causing clients to fail to parse errors by making errors longer isn't going to break those clients, only cause them to have no failure reason, which sucks but if you don't upgrade in two years that seems like a reasonable level of breakage.

Oct 31 '22 16:10 TheBlueMatt

As mentioned on the call, I'm a bit dubious of the need for signaling today - it materially hurts privacy across the network, so unless we have a need to return long errors in the next year or two I don't think it's worth it. Causing clients to fail to parse errors by making errors longer isn't going to break those clients, only cause them to have no failure reason, which sucks but if you don't upgrade in two years that seems like a reasonable level of breakage.

It's not just no failure reason, but those clients also lose the information about which node is the error source. See fat errors on the ML for a description of how the various implementations handle this.

Oct 31 '22 17:10 joostjager

Right, that's what I meant by "no failure reason", sorry it was unclear. My point is that if you don't upgrade for two years suffering from degraded route selection performance seems like a perfectly fine outcome.

Oct 31 '22 19:10 TheBlueMatt

Maybe inbound routing fees become a success, and then we need to wait for two years before we can return a complete fee_insufficient failure that exceeds 256 bytes. Or it does not become a success, exactly because the required failure cannot be returned. No fee update failure isn't ideal given the quirks that we sometimes see with gossip propagation.

Of course everyone can have their own opinion about the viability of protocol extensions, the value of experimentation and whether that's worth the trade-off with privacy.

Oct 31 '22 20:10 joostjager

I'm kinda confused why a fee_insufficient failure is > 256 bytes? There's a ton of headroom above existing error messages to play with.

Oct 31 '22 23:10 TheBlueMatt

If the fee is insufficient, it could be that the sender either used an outdated inbound fee policy, an outdated outbound fee policy or both. The routing node only sees a total fee paid, and doesn't know which one of the cases applies.

To cover all cases and ensure that the next attempt is successful, the routing node would need to return both the incoming channel update and the outgoing channel update with the fee_insufficient failure. 256 bytes is not enough for two channel updates.

Of course signed fee information can be returned in a more compact way, but that would deviate from the standard channel update format. I expect that to lead to lots of extra work and maintenance. The graph database for example would also need to support this new compact format.

Nov 01 '22 08:11 joostjager

Of course signed fee information can be returned in a more compact way, but that would deviate from the standard channel update format. I expect that to lead to lots of extra work and maintenance. The graph database for example would also need to support this new compact format.

That sounds like the way forward if you want to do experimentation though? You really don't need to return complete channel updates, and returning only the values you're interested in lets you experiment today, without changes from any implementation nor the spec required.

Nov 02 '22 07:11 t-bast

Yes, could do that. But the quicker way to go forward with the experiment is to just signal using a custom onion payload tlv record. Long failures are already relayed back, so no implementation changes required. And for the rest it's just between sender and routing node.

I thought that it would be nice to make that signaling official if we see a future need for long failure messages anyway. But if not, I'll just go ahead and do my own thing.

In that case, I can remove the increase of the recommended failure length to 1024 from this PR and note that longer failure messages are a risk because senders may not be able to parse.

Nov 02 '22 10:11 joostjager

Removed increase of the recommended message length to 1024 from PR and added comment about old nodes not being able to interpret long messages.

Nov 21 '22 11:11 joostjager

@joostjager do you have a branch of lnd where I can force a node to emit an error bigger than 256 bytes that contains dummy tlvs to test cross-compatibility?

Dec 05 '22 07:12 t-bast

If you run an lnd routing node with https://github.com/lightningnetwork/lnd/pull/6967 and the sender does not include enough fee, you will get back a long failure message with two channel updates. The sender does need to signal long failure compatibility through a tlv record though.

If this is all a bit too much for the purpose of interop testing, I can prep a simpler branch with a long failure for you.

Dec 05 '22 10:12 joostjager

If this is all a bit too much for the purpose of interop testing, I can prep a simpler branch with a long failure for you.

Don't worry, this should work well enough for my tests. Can you simply update the description of https://github.com/lightningnetwork/lnd/pull/6967 with the details of the tlv that needs to be set in the onion to opt-in to receive the long failure?

Dec 05 '22 10:12 t-bast

Description updated.

Dec 05 '22 11:12 joostjager