lnd icon indicating copy to clipboard operation
lnd copied to clipboard

multi: prevent duplicates for locally dispatched HTLC attempts

Open calvinrzachman opened this issue 6 months ago • 12 comments

Change Description

As mentioned in https://github.com/lightningnetwork/lnd/pull/9489, we intend to offload path-finding, onion construction and payment life-cycle management to an external entity (with a remotely instantiated ChannelRouter type) capable of orchestrating payments across a set of remote lnd instances. The lnd instances would be relegated to accept onion payments for direct delivery to the network. Ideally lnd's existing types could be made general enough to work in both this and more traditional deployment scenarios.

We'd like to prevent duplicate payment attempts and unintentional loss of funds when using the forthcoming switchrpc gRPC sub-server and SendOnion/TrackOnion RPCs to dispatch and track payments https://github.com/lightningnetwork/lnd/pull/9489.

One idea is to make SendHTLC duplicate safe for given attempt ID so that callers of this function may retry until they receive an explicit and unambiguous acknowledgement from the Switch regarding the status of the request to dispatch an HTLC attempt. This appears un-interesting in the default case where ChannelRouter and Switch run together in a single lnd binary, but becomes more important for correctness when the Router and Switch run in separate processes with synchronous communication via RPC - as there are different failure domains which must be dealt with.

This PR explores a modification to the Switch's networkResultStore to enable this while hopefully minimizing the necessary ChannelRouter/Switch changes. To accomplish this, we durably persist record of intent to dispatch an HTLC prior to sending that HTLC for forwarding to the broader network.

The SendHTLC() method is updated such that it is safe to call multiple times with the same attempt ID. With this change, we'd now have two writes to disk involved in the outgoing HTLC dispatch flow from the perspective of the Switch. The "routing node" responsibility of forwarding of HTLCs is not impacted.

  1. s.attemptStore.InitAttempt(attemptID): proposed addition. updates attempt result store to checkpoint/durably mark in persistent store some information about the existence of an attempt prior to sending that attempt out to the network.
  2. s.circuits.CommitCircuits(circuit): exists today. tracks flow of HTLC through Switch. Does not prevent duplicates for attempts which have already received a result from the network (settle/fail).

NOTE: The Switch's CircuitMap offers some protection against duplicate attempts with the same ID, but the life-cycle of this protection is not tied to the Router's management of payment life-cycle (eg: DeleteCircuits cleans CircuitMap potentially prior to results being read from the Switch's attempt result store) and appears insufficient for our use case. Once a result arrives and the circuit is torn down, that attempt ID becomes reusable and thus capable of having it's result overwritten within the Switch's store - which can be tricky for remote clients which may go offline for an arbitrary length of time and need to rely on the ability to safely retry an HTLC dispatch request.

It appears as though it is possible to achieve our aim without modifying the on-disk structure of the networkResult type used by this store. One idea is to overload the use of the lnwire.Msg field to allow it to encode a 3rd state besides Settle/Fail - namely Initiated/Pending/In-Flight. This seems to be the minimal required ability to be able to offer the InitAttempt duplicate protection which lasts the entire life-cycle of the attempt, so that the ChannelRouter and its use of the PaymentAttemptDispatcher interface can be general enough to work both in the case where Router and Switch run in the same process (as is the case for the large majority of lnd deployments) and in the scenario where they run in different processes.

TODO

  • Modify the ChannelRouter to call SendHTLC defensively when resuming payments after a restart.
    • Local/Monolithic Deployment: the usual case where everything runs in a single binary. With the duplicate safety, this call harmlessly errors and we proceed to tracking the result like normal.
    • Remote Router Deployment: instantiate a type with an alternate RPC backed implementation of SendHTLC which would attempt to deliver the onion via switchrpc SendOnion with a retry loop. The implementation of SendOnion/switch.SendHTLC executed server side would prevent any duplicates for the same ID and allow the client to safely retry until it receives a definitive response from the server. The ChannelRouter and its use of the PaymentAttemptDispatcher interface can be general enough to work both in the case where Router and Switch run in the same process (as is the case for the large majority of lnd deployments) and in the scenario where they run in different processes.
// Possibly via an alternate implementation of SendHTLC, but this pattern applies more generally.
// func (p *rpcPaymentDispatcher) SendHTLC(firstHop lnwire.ShortChannelID,
// attemptID uint64, htlcAdd *lnwire.UpdateAddHTLC) error {

       for {
               resp, err := SendOnion(A)
               if err == nil || err == ErrDuplicateAdd {
                       break // HTLC in-flight or just accepted; proceed to track
               }
               continue
       }

       // ... Sometime later possibly from another go routine (eg: Router via
       // GetAttemptResult)
       TrackOnion(A)

Steps to Test

  • Ran the unit tests for htlcswitch package locally and everything passes: make unit-debug log="stdlog trace" pkg=htlcswitch timeout=5m
  • go test -v -timeout 30s -run ^TestNetworkResultStore$ github.com/lightningnetwork/lnd/htlcswitch
  • More tests needed? ...

calvinrzachman avatar Jul 08 '25 04:07 calvinrzachman

@bitromortac yes, we can certainly expand our switchrpc server to support some method of attempt information cleanup.

Two potential behaviors for deletion I have considered thus far are:

  1. CleanStore(keepSet []ID): deletes everything except that which it's told to keep.
  2. DeleteAttempts(deleteSet []ID): deletes only what is explicitly specified to delete.

There are some differences to the two approaches, but for now we have an implementation of a CleanStore RPC which follows the approach lnd has taken historically for deletion from the Switch's attempt store here: https://github.com/calvinrzachman/lnd/pull/19

You're quite right that it is not safe to have multiple HTLC dispatching entities independently calling CleanStore without coordination. The CleanStore approach to deletion requires ID space separation. It cannot work with multiple clients sharing the same flat attempt ID space. There needs to be an identity-aware mapping of attempt IDs - the Switch must know:

  • Which client (or namespace) "owns" each attempt ID.
  • Therefore, it can safely say: “delete all of my IDs except the ones I still care about.”

Though I don't have this entirely flushed out, I think it might be possible for multiple routers to share an ID space if:

  1. The Switch RPC server centrally hands out attempt IDs for all router clients via a NextAttemptID endpoint.
  2. The routers use a DeleteAttempts based approach to deletion.

A minimally working version of this would require a flag to turn off the server's router to prevent its calls to CleanStore, right?

Yes, exactly! We can disable lnd's typical cleanup behavior which occurs on Router startup via routing/htlcswitch configuration or by entirely disabling the routerrpc and the on-board ChannelRouter all together via something like this: https://github.com/lightningnetwork/lnd/issues/10110 or this https://github.com/lightningnetwork/lnd/pull/10178.

calvinrzachman avatar Aug 26 '25 20:08 calvinrzachman

@ellemouton: review reminder @calvinrzachman, remember to re-request review from reviewers when ready

lightninglabs-deploy avatar Nov 17 '25 20:11 lightninglabs-deploy

@gemini-code-assist can you verify if this approach does not violate capsulation prinicples. Why do we want the switch to prevent duplicate payments. In my understanding the service launching the payment should take take of this rather then giving the switch the responsibility. Ideally the switch does not even care about the attempt at all, it just sends out and HTLC and reports its result.

ziggie1984 avatar Nov 25 '25 06:11 ziggie1984

@ziggie1984 That's a very insightful question regarding encapsulation and responsibility, and it touches on a core design consideration for this change.

You're absolutely right that, in a tightly coupled system where the ChannelRouter and Switch run in the same process, the Switch ideally wouldn't need to concern itself with duplicate payment attempts. In such a scenario, the ChannelRouter could manage the idempotency and retry logic entirely.

However, the primary motivation for this change, as outlined in the pull request description, is to support scenarios where the ChannelRouter (the service launching the payment) and the Switch are decoupled and communicate via RPC. In such a distributed environment, several challenges arise:

  1. Network Failures/Client Crashes: If a remote ChannelRouter sends an SendOnion (which calls switch.SendHTLC) request and doesn't receive an immediate, definitive response (e.g., due to a network timeout or a crash on either side), it needs to be able to safely retry that request. Without idempotency at the Switch level, a retry could lead to the same HTLC being dispatched multiple times for the same logical payment attempt, resulting in unintended duplicate payments or loss of funds.
  2. **

gemini-code-assist[bot] avatar Nov 25 '25 06:11 gemini-code-assist[bot]

@gemini-code-assist ok makes sense, but I still don't agree with the approach because it's using the resultStorage for something different than results. Why are we not just using the existiing CircuitMap, which already filters out duplicates ?

ziggie1984 avatar Nov 25 '25 08:11 ziggie1984

@ziggie1984 That's a very fair point and a crucial distinction to make. Let me clarify why the attemptStore (formerly networkResultStore) is being used for this purpose, and why the CircuitMap alone isn't sufficient for the idempotency guarantees we're aiming for in a decoupled architecture.

Why the attemptStore for pending states?

You're right that traditionally, the networkResultStore was primarily for final results (settle/fail). However, in the context of supporting remote ChannelRouter clients and ensuring "at most once" delivery, we need a durable, persistent record of an HTLC attempt's entire lifecycle, not just its final outcome.

  1. Idempotency for Retries: When a remote ChannelRouter sends an SendOnion request to the Switch (which internally calls SendHTLC), it needs to be able to safely retry that request if it doesn't receive an immediate, definitive response (e.g., due to network issues or crashes). By calling InitAttempt on the attemptStore before dispatching the HTLC, the Switch creates a durable record that an attempt with that attemptID has been initiated. If the ChannelRouter retries, the Switch can check the attemptStore, see the pending entry, and return ErrDuplicateAdd, preventing a duplicate HTLC from being sent. This ensures that the SendHTLC operation itself becomes idempotent.
  2. Crash Recovery: If the Switch crashes after InitAttempt but before the HTLC is fully committed to the CircuitMap and dispatched, the attemptStore retains the record of the pending attempt. On restart, the cleanupOrphanedAttempts routine (introduced in this PR) can identify these orphaned pending attempts and fail them, preventing them from hanging indefinitely and ensuring GetAttemptResult calls don't block.

Essentially, the attemptStore is being extended to manage the state of an attempt ID throughout its journey, from initiation to final resolution, providing the necessary durability and idempotency for a distributed system.

Limitations of CircuitMap for this use case

The CircuitMap is indeed excellent at filtering out duplicates for HTLCs that are currently in-flight within the Switch. However, its lifecycle is inherently shorter-lived and more transient than what's required for robust idempotency in a decoupled system:

  • Transient Nature: Once an HTLC is settled or failed, or if the circuit is torn down (e.g., by DeleteCircuits), the corresponding entry is removed from the CircuitMap. This means that if a remote ChannelRouter needs to query the status of an attemptID after the HTLC has completed its forwarding through the Switch, the CircuitMap would no longer hold that information.
  • No Pre-Dispatch Record: The CircuitMap only records HTLCs once they are committed for forwarding. It doesn't provide a mechanism to record the intent to dispatch an HTLC before it's fully processed and committed. This pre-dispatch record is what InitAttempt provides in the attemptStore, which is critical for handling retries from a remote client that might fail before the HTLC even reaches the CircuitMap stage.

In summary, while the CircuitMap handles in-flight duplicates, the attemptStore provides the durable, long-term, and pre-dispatch idempotency guarantees needed for a robust, decoupled ChannelRouter and Switch interaction. It acts as the authoritative source for the idempotent lifecycle of an attemptID.

gemini-code-assist[bot] avatar Nov 25 '25 08:11 gemini-code-assist[bot]

@gemini-code-assist I don't agree with you points why the circuitMap cannot be used.

Regarding the Transient Nature, for this case the dispatcher should first check with GetAttemptResult if this attempt has every received a result. Moreover this duplicate payment sending is only relevant for immediately failures, there is no need to check for duplicates which can happen between long times because the result store is cleaned eventually as well.

Regarding No Pre-Dispatch Record: I don't understand this part, it's totally fine because in that case we want to resend the htlc if it for some reason got not commited the first time.

ziggie1984 avatar Nov 25 '25 08:11 ziggie1984

@ziggie1984 Thanks for pushing back on these points; it's crucial to ensure we're all aligned on the design decisions. Let's break down your concerns further. It seems there might be a slight misunderstanding of the specific problem this change aims to solve, which is primarily about robust idempotency for decoupled ChannelRouter clients interacting with the Switch via RPC, especially in the face of network failures and crashes.

Regarding the "Transient Nature" and GetAttemptResult

You mentioned that the dispatcher could first check GetAttemptResult and that duplicate sending is only relevant for immediate failures, as the result store is eventually cleaned.

My previous point about the CircuitMap's transient nature still holds for the decoupled scenario. The CircuitMap is an in-memory structure that tracks currently in-flight HTLCs. If the Switch process restarts, the CircuitMap is empty. If a remote ChannelRouter client sends an SendOnion request, and the Switch successfully processes InitAttempt but then crashes before the HTLC is committed to the CircuitMap and dispatched, the CircuitMap would have no record of it after restart.

Now, if the remote client retries the SendOnion request (which calls SendHTLC):

  1. Without InitAttempt in attemptStore: The Switch would have no record of the previous attempt. The client would send a new HTLC with the same attemptID, potentially leading to a true duplicate payment being dispatched. This is precisely what we want to prevent.
  2. With InitAttempt in attemptStore (as in this PR): The InitAttempt call creates a durable, persistent record. Even after a Switch crash and restart, the attemptStore retains this pending entry. When the client retries, SendHTLC will call InitAttempt again, find the existing pending entry, and return ErrPaymentIDAlreadyExists (which SendHTLC translates to ErrDuplicateAdd). This tells the client that the attempt was already initiated, preventing a duplicate dispatch. The cleanupOrphanedAttempts routine on startup also ensures that any such pending attempts that never made it to the CircuitMap are eventually failed, preventing them from hanging indefinitely.

So, while GetAttemptResult can tell you if a final result exists, it doesn't provide the necessary pre-dispatch and crash-resilient idempotency guarantee that InitAttempt in the attemptStore offers for remote clients.

Regarding "No Pre-Dispatch Record"

You stated that it's

gemini-code-assist[bot] avatar Nov 25 '25 08:11 gemini-code-assist[bot]

@calvinrzachman have you explored the CircuitMap as ultimate truth ?

ziggie1984 avatar Nov 25 '25 08:11 ziggie1984

I think it would be nice to mention what SendHTLC is currently only safe to run and be called from one Router and more that that (attemptID collision)

ziggie1984 avatar Dec 05 '25 21:12 ziggie1984

Almost, two final comments left:

  1. please remove the pendingMsg check in the new subscribenotifier function in the resultstore
  2. please make sure you add the unit-tests in a separate PR regarding the Duplicate Drops/Deletes I mentioned above.

ziggie1984 avatar Dec 08 '25 11:12 ziggie1984

@ziggie1984 I updated the error name to ErrAttemptResultPending as requested and removed the pending check from the notifySubscribers helper. You're right that we keep the function "pure" notification only and simply just not call this method when it is not appropriate to do so.

We do have unit tests for the CircuitMap which verify the duplicate drops/deletes so we're not completely in uncharted waters, but I can put up a draft branch which adds a test to switch_test.go for this scenario you highlight.

calvinrzachman avatar Dec 10 '25 03:12 calvinrzachman