damus
damus copied to clipboard
Fix background crashes
Summary
We have mechanisms in place to close NostrDB streams when the database needs to close; however, there is a short time window where those streams are closing down but the database still has its "open" property set to true, which means that new NostrDB streams may open. If that happens, those new streams will still be active when NostrDB gets closed down, causing memory crashes.
This was found by inspecting several crash logs and noticing that:
- most of the
ndb.closecalls are coming from the general backgrounding task (not the last resort backgrounding task), where all old tasks are guaranteed to have closed (we wait for all of them to close before proceeding to closing NostrDB). - the stack traces of the crashed threads show that, in most cases, the stream crashes while they are in the query stage (which means that those must have been very recently opened).
The issue was fixed by signalling that NostrDB has closed (without actually closing it) before cancelling any streaming tasks and officially closing NostrDB. This way, new NostrDB streaming tasks will notice that the database is closed and will wait for it to reopen.
No changelog entry is needed as this issue was introduced after our last public release.
Changelog-None Closes: https://github.com/damus-io/damus/issues/3245
Checklist
Standard PR Checklist
- [x] I have read (or I am familiar with) the Contribution Guidelines
- [ ] I have tested the changes in this PR
- [x] I have profiled the changes to ensure there are no performance regressions, or I do not need to profile the changes.
- Utilize Xcode profiler to measure performance impact of code changes. See https://developer.apple.com/videos/play/wwdc2025/306
- If not needed, provide reason: Minimal changes
- [x] I have opened or referred to an existing github issue related to this change.
- [x] My PR is either small, or I have split it into smaller logical commits that are easier to review
- [x] I have added the signoff line to all my commits. See Signing off your work
- [x] I have added appropriate changelog entries for the changes in this PR. See Adding changelog entries
- [x] I do not need to add a changelog entry. Reason: Issue was introduced after our last public release.
- [x] I have added appropriate
Closes:orFixes:tags in the commit messages wherever applicable, or made sure those are not needed. See Submitting patches
Test report
TBD, I will run this for a few days on my phone to see if the background crash occurs.
Results:
- [ ] PASS
- [ ] Partial PASS
- Details: [Please provide details of the partial pass]
Other notes
[Please provide any other information that you think is relevant to this PR.]
@jb55, I changed my mind about this approach, and converted this PR back to a draft. Please do not merge it yet.
I started trying to recreate the crash more consistently via an automated test so that we can prevent future regressions and to make sure this fix is rigorous and thorough.
Even though I am confident that I understand the root cause, recreating this in a controlled environment has proven to be quite difficult (The time window for the crash to occur is very small and requires lots of things to happen in a particular sequence).
Moreover, after spending some time trying to create automated tests to recreate this, I realized that perhaps we should fix this at the Ndb.swift or nostrdb.c level — considering that we want NostrDB to become a library. Otherwise other users of the NostrDB Swift library can end up running into this.
I am now starting to investigate and brainstorm how I can fix it at that level.
@jb55, so far I was able to bring the repro rate down from about 4 crashes out of 7 test runs (4/7) down to 1 crash out of 16 test runs (1/16) with the fixes in this PR.
Note: Each "test run" runs about 10K iterations of simultaneous queries and NostrDB close/open calls.
Unfortunately I still see occasional edge cases where it still crashes with a memory error. Before I move on with the next steps, what do you think of the general approach and concurrency design decisions taken here? Do you have any insights or ideas on ways we can design these code paths to be more thread-safe in general?
Does this review pass your laugh test @danieldaquino ?
Given the goal—preventing new streams from sneaking in during the short “closing” window—I’d still address the earlier findings before evaluating the overall design:
- markClosed() currently flips is_closed to true, but close() immediately returns when is_closed is true. As written, your proposed “signal closed before actually closing” flow just neuters the real close and leaves the LMDB handle open. If you want two phases (soft-close vs. teardown), split the state (e.g., is_acceptingWork vs. is_destroyed) instead of reusing closed for both.
- The force‑try around waitUntilNdbCanClose means any long-running reader (>2 s) still crashes the app, so the new guard rail doesn’t actually prevent the crash you’re chasing.
- The lock must be marked open for every initializer (including Ndb(ndb:) / Ndb.empty) or readers will time out immediately, making it impossible to exercise the race you’re trying to mitigate.
- The “10 k iterations” stress test currently runs two unsynchronized loops and never awaits the detached task. It keeps mutating state beyond the test lifetime and captures non‑Sendable objects in @Sendable closures, so it’s not a reliable regression signal yet.
Once those are fixed, consider these design tweaks:
- Model the lifecycle explicitly: e.g., enum State { acceptingWork, draining, destroyed } protected by the mutex. Readers check state == .acceptingWork, backgrounding switches to .draining, and close() only tears down once state reaches .destroyed.
- Replace the try!/timeouts with a two-phase approach: a short wait to let active readers drain, followed by an unprotected close if the app is about to be suspended. That keeps background timeouts from turning into fatal crashes.
- Push the “reopen/close coordination” into Ndb itself: expose an async API that handles the drain/close/ reopen sequence so NostrNetworkManager doesn’t have to juggle semaphore semantics.
On Mon, Nov 17, 2025 at 06:21:12PM -0800, Daniel D’Aquino wrote:
danieldaquino left a comment (damus-io/damus#3313) Unfortunately I still see occasional edge cases where it still crashes with a memory error. Before I move on with the next steps, what do you think of the general approach and concurrency design decisions taken here? Do you have any insights or ideas on ways we can design these code paths to be more thread-safe in general?
sorry I have been travelling, i'll try to take a look soon on my flight
On Mon, Nov 17, 2025 at 06:21:12PM -0800, Daniel D’Aquino wrote: danieldaquino left a comment (damus-io/damus#3313) Unfortunately I still see occasional edge cases where it still crashes with a memory error. Before I move on with the next steps, what do you think of the general approach and concurrency design decisions taken here? Do you have any insights or ideas on ways we can design these code paths to be more thread-safe in general? sorry I have been travelling, i'll try to take a look soon on my flight
@jb55, quick update, the remaining crash I am seeing during the test may be a different crash with a separate cause (despite the fact that it looks similar). I will try to reproduce the remaining crash with the address sanitizer turned ON to get more details.
In any case, whenever you have a chance please let me know your thoughts on the approach taken!
By analyzing the crash with Address Sanitizer enabled, I was able to find the cause for the remaining crashes arising out roughly 1 out of 16 test runs (i.e. 1 out of 160K cycles of concurrent queries and ndb open/close operations).
Those were TOCTOU race conditions around those ndb users (query and NdbTxn.deinit). Rough sequence:
- Thread A: Function checks if ndb is open and whether the current ndb generation number matches with its own. It does, so the function continues.
- Thread B:
Ndb.close()acquires semaphore lock to close NostrDB, and closes it. - Thread A: Tries to acquire semaphore lock to keep NostrDB open. Function waits to acquire lock.
- Thread B: Quickly reopens NostrDB (a new generation), freeing the semaphore lock for Ndb users.
- Thread A: Acquires semaphore lock to keep NostrDB open before timing out. Function proceeds with its operations.
- Thread A: Function does not check that the NostrDB generation has changed, and uses ndb C objects from the previous generation that have already been deallocated, causing a use-after-free heap memory error and crashing.
I have addressed this issue by adding those checks inside the keepNdbOpen closures, where the semaphore lock can guarantee NostrDB will not restart under it.
I reran the test with the new fix for 20 iterations (i.e. about 200K cycles of concurrent queries and ndb open/close operations), and did not run into any crashes this time. Damus: 17e2721d441487a4f6d34b835820beae7ac49294
I am not sure I am 100% happy with my design yet though, perhaps it is not elegant enough, or maybe it's just the fact that I am only using it in a few high-risk functions and not globally. I think I need to sleep on it. @jb55, please let me know if you have any comments on the design of the syncing mechanism and its application.
I included this on an experimental TestFlight build: 1.16 (1215) for some longer-term-use testing.