lightning Unable to pay extreme multipart payment in tests

As a follow up to issue #7563, I have tested renepay on the same topology that @daywalker90 proposed for testing getroutes. It turned out the payment failed with failed to find a feasible flow due to a sequence of HTLC failures on a remote channel that should have had enough liquidity.

To reproduce the problem I tried the following test using only sendpay. The payment flow is: l1 -> l2 -> l3, where there are more than one channels connecting l1->l2 and l2->l3.

When trying to send a 4 part payment with the following routes

part 1 with 350k in the route l1->l2->l3 over a couple of channels with capacity 400k,
part 2 with 250k in the route l1->l2->l3 over a couple of channels with capacity 300k,
part 3 with 150k in the route l1->l2->l3 over a couple of channels with capacity 200k,
part 4 with 50k in the route l1->l2->l3 over a couple of channels with capacity 100k.

One or more payment parts always fail at the l2->l3 hop with CHANNEL_ERR_CHANNEL_CAPACITY_EXCEEDED seen at l2 logs.

Another simpler case that fails as well is:

part 1 with 200k in the route l1->l2->l3 over a couple of channels with capacity 400k,
part 2 with 200k in the route l1->l2->l3 over a couple of channels with capacity 300k.

If instead I try making a single part payment:

350k in the route l1->l2->l3 over a couple of channels with capacity 400k. The attempt succeeds.

def get_local_channel_by_id(node, chanid):
    peerchannels = node.rpc.listpeerchannels()['channels']
    if not peerchannels:
        return None
    for c in peerchannels:
        if c['channel_id']==chanid:
            return c
    return None

def start_channels(connections):
    nodes = list()
    for src, dst, fundamount in connections:
        nodes.append(src)
        nodes.append(dst)
        src.rpc.connect(dst.info["id"], "localhost", dst.port)

    bitcoind = nodes[0].bitcoin
    # If we got here, we want to fund channels
    for src, dst, fundamount in connections:
        addr = src.rpc.newaddr()["bech32"]
        bitcoind.rpc.sendtoaddress(addr, (fundamount + 1000000) / 10**8)

    bitcoind.generate_block(1)
    sync_blockheight(bitcoind, nodes)
    txids = []
    chan_ids = []
    for src, dst, fundamount in connections:
        reply = src.rpc.fundchannel(dst.info["id"], fundamount, announce=True, minconf=0)
        txids.append(reply["txid"])
        chan_ids.append(reply["channel_id"])

    # Confirm all channels and wait for them to become usable
    bitcoind.generate_block(1, wait_for_mempool=txids)
    scids = []
    for con, mychan_id in zip(connections, chan_ids):
        src = con[0]
        wait_for(lambda: get_local_channel_by_id(src, mychan_id)['state'] == "CHANNELD_NORMAL")
        scids.append(get_local_channel_by_id(src, mychan_id)['short_channel_id'])

    # Make sure they have all seen block so they don't complain about
    # the coming gossip messages
    sync_blockheight(bitcoind, nodes)
    bitcoind.generate_block(5)

    # Make sure everyone sees all channels, all other nodes
    for n in nodes:
        for scid in scids:
            n.wait_channel_active(scid)

    # Make sure we have all node announcements, too
    for n in nodes:
        for n2 in nodes:
            wait_for(
                lambda: "alias" in only_one(n.rpc.listnodes(n2.info["id"])["nodes"])
            )
    return chan_ids

def test_channel_reserve(node_factory):
    def direction(node1, node2):
        return 0 if node1.info['id']<node2.info['id'] else 1
    
    opts = {"disable-mpp": None, "fee-base": 0, "fee-per-satoshi": 0, "cltv-delta": 6}
    l1, l2, l3 = node_factory.get_nodes(3, opts=opts)
    
    chan_ids = start_channels(
        [
            (l1, l2, 400_000),
            (l2, l3, 400_000),
            (l1, l2, 300_000),
            (l2, l3, 300_000),
            (l1, l2, 200_000),
            (l2, l3, 200_000),
            (l1, l2, 100_000),
            (l2, l3, 100_000),
        ]
    )
    
    # we should be able to send these four parts:
    # nparts = 4
    # route_amounts = ["350000sat", "250000sat", "150000sat", "50000sat"]
    # instead the test fails even with these:
    nparts = 2
    route_amounts = ["201000sat", "200000sat", "150000sat", "50000sat"]
    total_msat = sum( [ Millisatoshi(a) for a in route_amounts[:nparts] ])
    
    # Test succeeds if we are able to pay this invoice
    inv = l3.rpc.call("invoice", {"amount_msat": total_msat,
        "label":"inv","description":"inv","cltv":10})
    inv_decode = l1.rpc.decode(inv["bolt11"])
    
    # Shared data by every route we will construct: l1->l2->l3
    route = [{"id": l2.info["id"],
        "direction": direction(l1,l2),
        "delay": 16,
        "style": "tlv"},
        {"id": l3.info["id"],
        "direction": direction(l2,l3),
        "delay": 10,
        "style": "tlv"}]

    # Send every part with sendpay
    for part in range(nparts):
        this_part_msat = Millisatoshi(route_amounts[part])
        chan1 = get_local_channel_by_id(l1, chan_ids[part*2])
        chan2 = get_local_channel_by_id(l2, chan_ids[part*2+1])
        
        route[0]["channel"] = chan1["short_channel_id"]
        route[1]["channel"] = chan2["short_channel_id"]
        route[0]["amount_msat"] = route[1]["amount_msat"] = this_part_msat
        
        assert chan1["spendable_msat"]>=this_part_msat
        assert chan2["spendable_msat"]>=this_part_msat
        
        l1.rpc.call(
            "sendpay",
            {"route": route,
             "payment_hash": inv["payment_hash"],
             "payment_secret": inv["payment_secret"],
             "amount_msat": total_msat,
             "groupid": 1,
             "partid": part+1,},
        )
    l1.wait_for_htlcs()
    
    # Are we happy?
    waitsendpay_response = l1.rpc.call("waitsendpay",{"payment_hash":
        inv["payment_hash"], "partid": 1, "groupid": 1})
    receipt = only_one(l3.rpc.listinvoices("inv")["invoices"])
    assert receipt["status"] == "paid"
    assert receipt["amount_received_msat"] == total_msat

Aug 23 '24 10:08 Lagrang3

I think the problem is with the parallel channels. On l2 channeld is only seeing one channel with his peer l1.

I can see that lightningd's peer structure has a list of channels (defined in lightningd/peer_control.h), but it seems that channeld's peer structure has only one channel (defined in channeld/channeld.c). Keep looking...

UPD: There's one channeld instance for every channel. So maybe lightningd is sending the HTLC request to the wrong subdaemon?

As a matter of fact in the logs:

035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-channeld-chan#2: Adding HTLC 0 amount=501000000msat cltv=119 gave CHANNEL_ERR_ADD_OK
...
035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-channeld-chan#2: Adding HTLC 1 amount=500000000msat cltv=119 gave CHANNEL_ERR_CHANNEL_CAPACITY_EXCEEDED

both HTLC requests go to the same channeld-chan#2, one should have gone to channeld-chan#2 and the other to channeld-chan#4.

Aug 23 '24 14:08 Lagrang3

There is a channel selection in lightningd before forwarding: https://github.com/ElementsProject/lightning/blob/9d88ce3b592ca42a01104758313b9b2806d40230/lightningd/peer_htlcs.c#L1215

Is that a good idea, why wouldn't we try to forward right on the channels we have been requested to?

Clearly in this case that function is not selecting the best channel.

Aug 24 '24 08:08 Lagrang3

Though non-strict forwarding is allowed: https://github.com/lightning/bolts/blob/master/04-onion-routing.md#non-strict-forwarding

Aug 24 '24 13:08 Lagrang3