rust-lightning icon indicating copy to clipboard operation
rust-lightning copied to clipboard

Disallow dual-sync-async persistence without restarting

Open TheBlueMatt opened this issue 8 months ago • 12 comments

In general, we don't expect users to persist
`ChannelMonitor[Update]`s both synchronously and asynchronously for
a single `ChannelManager` instance. If a user has implemented
asynchronous persistence, they should generally always use that,
as there is then no advantage to them to occasionally persist
synchronously.

Even still, in 920d96edb6595289902f287419de2d002e2dc2ee we fixed
some bugs related to such operation, and noted that "there isn't
much cost to supporting it". Sadly, this is not true.

Specifically, the dual-sync-async persistence flow is ill-defined
and difficult to define in away that a user can realistically
implement.

Consider the case of a `ChannelMonitorUpdate` which is persisted
asynchronously and while it is still being persisted a new
`ChannelMonitorUpdate` is created. If the second
`ChannelMonitorUpdate` is persisted synchronously, the
`ChannelManager` will be left with a single pending
`ChannelMonitorUpdate` which is not the latest.

If we were to then restart, the latest copy of the `ChannelMonitor`
would be that without any updates, but the `ChannelManager` has a
pending `ChannelMonitorUpdate` for the next update, but not the one
after that. The user would then have to handle the replayed
`ChannelMonitorUpdate` and then find the second
`ChannelMonitorUpdate` on disk and somehow know to replay that one
as well.

Further, we currently have a bug in handling this scenario as we'll
complete all pending post-update actions when the second
`ChannelMonitorUpdate` gets persisted synchronously, even though
the first `ChannelMonitorUpdate` is still pending. While we could
rather trivially fix these issues, addressing the larger API
question above is difficult and as we don't anticipate this
use-case being important, we just disable it here.

Note that we continue to support it internally as some 39 tests
rely on it.

Issue highlighted by (changes to the) chanmon_consistency fuzz
target (in the next commit).

TheBlueMatt avatar Apr 15 '25 15:04 TheBlueMatt

👋 Thanks for assigning @valentinewallace as a reviewer! I'll wait for their review and will help manage the review process. Once they submit their review, I'll check if a second reviewer would be helpful.

ldk-reviews-bot avatar Apr 15 '25 15:04 ldk-reviews-bot

We may be able to remove the test test_sync_async_persist_doesnt_hang since it specifically covers async + sync persistence.

valentinewallace avatar Apr 15 '25 18:04 valentinewallace

Would it be a good take-a-Friday issue to fix the legacy tests that use both sync + async so we don't need to gate the panics anymore?

valentinewallace avatar Apr 16 '25 17:04 valentinewallace

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 90.26%. Comparing base (83e9e80) to head (cb6219d). Report is 60 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3737      +/-   ##
==========================================
+ Coverage   89.12%   90.26%   +1.13%     
==========================================
  Files         156      158       +2     
  Lines      123514   134266   +10752     
  Branches   123514   134266   +10752     
==========================================
+ Hits       110086   121195   +11109     
+ Misses      10749    10536     -213     
+ Partials     2679     2535     -144     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Apr 16 '25 20:04 codecov[bot]

Would it be a good take-a-Friday issue to fix the legacy tests that use both sync + async so we don't need to gate the panics anymore?

Yea, I imagine its a good bit of work, though.

TheBlueMatt avatar Apr 16 '25 20:04 TheBlueMatt

Needs a squash

wpaulino avatar Apr 28 '25 17:04 wpaulino

Squashed with a minor comment fix:

$ git diff-tree -U1 ddccbdabd d009b3918
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index c86eed77b..54b607d88 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -2573,4 +2573,4 @@ where
 	/// [`ChannelMonitorUpdateStatus::Completed`] without restarting. Because the API does not
-	/// otherwise directly enforce this, we enforce it in debug builds here by storing which one is
-	/// in use.
+	/// otherwise directly enforce this, we enforce it in non-test builds here by storing which one
+	/// is in use.
 	#[cfg(not(test))]

TheBlueMatt avatar Apr 29 '25 13:04 TheBlueMatt

These externalized tests seem to be failing:

- test_batch_funding_close_after_funding_signed
- test_batch_channel_open

wpaulino avatar Apr 29 '25 17:04 wpaulino

Oof, ugh, right. Squashed-pushed a fix:

$ git diff-tree -U1 d009b3918 cb6219df6
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 54b607d88..34751bfc9 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -2575,3 +2575,3 @@ where
 	/// is in use.
-	#[cfg(not(test))]
+	#[cfg(not(any(test, feature = "_externalize_tests")))]
 	monitor_update_type: AtomicUsize,
@@ -3321,3 +3321,3 @@ macro_rules! handle_new_monitor_update {
 			ChannelMonitorUpdateStatus::InProgress => {
-				#[cfg(not(test))]
+				#[cfg(not(any(test, feature = "_externalize_tests")))]
 				if $self.monitor_update_type.swap(1, Ordering::Relaxed) == 2 {
@@ -3330,3 +3330,3 @@ macro_rules! handle_new_monitor_update {
 			ChannelMonitorUpdateStatus::Completed => {
-				#[cfg(not(test))]
+				#[cfg(not(any(test, feature = "_externalize_tests")))]
 				if $self.monitor_update_type.swap(2, Ordering::Relaxed) == 1 {
@@ -3594,3 +3594,3 @@ where

-			#[cfg(not(test))]
+			#[cfg(not(any(test, feature = "_externalize_tests")))]
 			monitor_update_type: AtomicUsize::new(0),
@@ -14767,3 +14767,3 @@ where

-			#[cfg(not(test))]
+			#[cfg(not(any(test, feature = "_externalize_tests")))]
 			monitor_update_type: AtomicUsize::new(0),

TheBlueMatt avatar Apr 30 '25 01:04 TheBlueMatt

Wondering whether the problem where chan mgr will be left with a single pending update which is not the latest can also occur when purely async is used, but channel_monitor_updated are called in the reverse order?

MonitorEvent::Completed is only allowed to signal "all updates to and through the given id have completed" to avoid this case. ChainMonitor tracks this and generates the right MonitorEvetns.

TheBlueMatt avatar Apr 30 '25 15:04 TheBlueMatt

LGTM after squash

valentinewallace avatar May 01 '25 19:05 valentinewallace

Re-dropped the constant after discussion.

TheBlueMatt avatar May 02 '25 00:05 TheBlueMatt

Okay, I believe all feedback has been addressed.

TheBlueMatt avatar May 05 '25 17:05 TheBlueMatt

Squashed without further changes.

TheBlueMatt avatar May 05 '25 18:05 TheBlueMatt