damus
damus copied to clipboard
nostrdb update: note block rendering issue
It seems on the new nostrdb update branch, the note blocks are not immediately available sometimes. This makes sense, since we have not switch to the local relay model yet.
Let's just make sure we have lazy block parsing working like we did before.
@jb55, after analyzing and testing the nostrdb branch, I am not sure I fully understand what the issue is here.
- I can't reproduce the problem on iOS, notes seem to be rendering as expected as far as I have been trying things out.
- From studying the code changes, it seems like we already have lazy block parsing (More on that below).
1. Repro attempt
Damus: 09cfb332b3d715d7663a8bff826c488ba85f875f Device: iPhone SE simulator Steps:
- Scroll down on the timeline for about 3 minutes, checking whether note artifacts load properly.
Result: Notes seem to be loading as expected.
2. Code analysis
Damus: 09cfb332b3d715d7663a8bff826c488ba85f875f
NoteContentViewwill callloadwhen it appears, and when it receives aprofile_updatedevent.loadhas some loading and planning logic depending on the state, but it seems that it ultimately callsrender_note_contentfor loading the note.render_note_contentcallsev.blocks(ndb: ndb), which callslookup_blocks_by_key_with_txn, which in turn callsndb_get_blocks_by_key.ndb_get_blocks_by_keyhas this portion of code which indicates the blocks are lazily generated if not on the database already:
if ((blocks = ndb_lookup_by_key(txn, note_key, NDB_DB_NOTE_BLOCKS, ¬e_len))) {
return blocks;
}
// If we don't have note blocks, let's lazily generate them. This is
// migration-friendly instead of doing them all at once
if (!(note = ndb_get_note_by_key(txn, note_key, ¬e_len))) {
// no note found, can't return note blocks
return NULL;
}
if (!(blocks = ndb_note_to_blocks(note)))
return NULL;
So it seems that it should lazily generate the note blocks even if they are not immediately available on the database. Where I see this could fail is if the note itself is not available on the database, or if there is some form of error somewhere during parsing, and the propagation of that error causes ev.blocks(ndb: ndb) to return nil.
Can you please help me understand what the issue is?
Made a very WIP draft (that does not fully work) for one of the possible solutions: https://github.com/damus-io/damus/pull/2963.
Solved some issues related to memory leaks, large CPU usage, and notes getting stuck while rendering.
Some notes are still getting stuck at rendering after scrolling down for a minute. It seems like I am running into some form of deadlock in NostrDB associated with a ndb_wait_for_notes call where all notes suddenly stop loading.
@jb55, are you aware of any potential pitfalls or things that could cause ndb_wait_for_notes to deadlock Ndb altogether?
On Mon, Apr 07, 2025 at 06:44:58PM -0700, Daniel D’Aquino wrote:
danieldaquino left a comment (damus-io/damus#2885)
Solved some issues related to memory leaks, large CPU usage, and notes getting stuck while rendering.
Some notes are still getting stuck at rendering after scrolling down for a minute. It seems like I am running into some form of deadlock in NostrDB associated with a
ndb_wait_for_notescall where all notes suddenly stop loading.@jb55, are you aware of any potential pitfalls or things that could cause
ndb_wait_for_notesto deadlock Ndb altogether?
I was going to remove wait_for_notes. what are you using it for? you should use the callback mechanism instead to get woken up when notes are received off of subscriptions.
one potential deadlock issue I see is that our ndb_notify_subscriptions calls the subscription callback while holding the monitor lock. This means if any callback code tries to poll for notes or wait for notes, unsubscribe or subscribe, it would deadlock...
do we have a subscription callback registered ?
I think the fix for this would be to call monitor->sub_cb outside of the lock at the bottom of ndb_notify_subscriptions
hmm it does not look like we are registering a subscription callback... but yeah we should be using that instead. wait_for_notes is very inefficient as it will block the thread and have massive contention on the monitor lock and can block other threads. in fact, i would go as far as to say wait_for_notes should never be used in a multithreaded context, which is why I was going to remove it.
sub callback is here:
var cfg = ndb_config(flags: 0, ingester_threads: ingest_threads, mapsize: mapsize, filter_context: nil, ingest_filter: nil, sub_cb_ctx: nil, sub_cb: nil)
@jb55, switched over to the subscription callback model and now it works very well!
In brief, this is how I achieved that:
- I have an object that maps subscription IDs to callbacks, suitable for handling the global callback:
+class NdbCallbackCaller {
+ var ndb: ndb_t? = nil
+ var subscriptionCallbackMap: [UInt64: (NoteKey) -> Void] = [:]
+
+ func handleSubscriptionCallback(subId: UInt64) {
+ if let callback = subscriptionCallbackMap[subId] {
+ let result = UnsafeMutablePointer<UInt64>.allocate(capacity: 1)
+ if let ndb {
+ ndb_poll_for_notes(ndb.ndb, subId, result, 1)
+ let noteKey: NoteKey = result.pointee
+ callback(noteKey)
+ }
+ result.deallocate()
+ }
+ }
+}
+
- I created the "trampoline" function in Swift that can be called back from C:
+// This C-exported callback function will be called from the C library when new notes arrive.
+@_cdecl("subscription_callback")
+public func subscription_callback(ctx: UnsafeMutableRawPointer?, subid: UInt64) {
+ guard let ctx = ctx else { return }
+ let holder = Unmanaged<NdbCallbackCaller>.fromOpaque(ctx).takeUnretainedValue()
+ holder.handleSubscriptionCallback(subId: subid)
+}
- I added this subscription mapping/handler object to our
Ndbinstance and point the callback to it in the C library:
- static func open(path: String? = nil, owns_db_file: Bool = true) -> ndb_t? {
+ static func open(path: String? = nil, owns_db_file: Bool = true, subscriptionCaller: NdbCallbackCaller) -> ndb_t? {
var ndb_p: OpaquePointer? = nil
let ingest_threads: Int32 = 4
@@ -111,6 +145,8 @@ class Ndb {
var ok = false
while !ok && mapsize > 1024 * 1024 * 700 {
var cfg = ndb_config(flags: 0, ingester_threads: ingest_threads, mapsize: mapsize, filter_context: nil, ingest_filter: nil, sub_cb_ctx: nil, sub_cb: nil)
+ let ctx: UnsafeMutableRawPointer = Unmanaged.passRetained(subscriptionCaller).toOpaque()
+ ndb_config_set_subscription_callback(&cfg, subscription_callback, ctx)
let res = ndb_init(&ndb_p, testdir, &cfg);
ok = res != 0;
if !ok {
class Ndb {
var ndb: ndb_t
let path: String?
let owns_db: Bool
var generation: Int
private var closed: Bool
+ var subscriptionCaller: NdbCallbackCaller
var is_closed: Bool {
self.closed || self.ndb.ndb == nil
}
-
+
+ func setSubscriptionCallback(subId: UInt64, _ callback: @escaping (NoteKey) -> Void) {
+ self.subscriptionCaller.subscriptionCallbackMap[subId] = callback
+ }
+
+ func unsetSubscriptionCallback(subId: UInt64) {
+ self.subscriptionCaller.subscriptionCallbackMap[subId] = nil
+ }
+
- And this allows me to easily integrate this with
AsyncStreaminterfaces in Swift, like so:
return AsyncStream<StreamItem> { continuation in
(...) // Stream initial results
// Stream new results
let subid = ndb_subscribe(self.ndb.ndb, filtersPointer, Int32(filters.count))
// Set the subscription callback
self.setSubscriptionCallback(subId: subid, { noteKey in
continuation.yield(.event(noteKey))
})
// On termination, cancel the subscription and cleanup.
continuation.onTermination = { @Sendable _ in
ndb_unsubscribe(self.ndb.ndb, subid)
self.unsetSubscriptionCallback(subId: subid)
(...)
}
}
I will start cleaning up the draft!
On Wed, Apr 09, 2025 at 03:02:16PM -0700, Daniel D’Aquino wrote:
danieldaquino left a comment (damus-io/damus#2885)
@jb55, switched over to the subscription callback model and now it works very well!
In brief, this is how I achieved that:
- I have an object that maps subscription IDs to callbacks, suitable for handling the global callback:
+class NdbCallbackCaller { + var ndb: ndb_t? = nil + var subscriptionCallbackMap: [UInt64: (NoteKey) -> Void] = [:] + + func handleSubscriptionCallback(subId: UInt64) { + if let callback = subscriptionCallbackMap[subId] { + let result = UnsafeMutablePointer<UInt64>.allocate(capacity: 1) + if let ndb { + ndb_poll_for_notes(ndb.ndb, subId, result, 1)
The only thing that makes be a bit uncomfortable is that you're only polling single notes off the queue. There could be more than one note in there, so this could lead to missed notes. There is not a unique one to one mapping from callbacks to notes.
You should either pass a list of notes in the callback (most efficient), or call the callback multiple times for each notes polled from poll_for_notes
Came back to this, and did some refactoring to better separate the "dangerous" C/Swift interface logic (Dealing with unsafe pointers) from the rest of the logic. I am still running into some random crashes. Debugging…
do you have a branch for this? I want to help debug
i tried to rebase this yesterday but was a bit hairy
do you have a branch for this? I want to help debug
I currently have the WIP at https://github.com/damus-io/damus/pull/2963/commits
i tried to rebase this yesterday but was a bit hairy
What were you rebasing? nostrdb-update against the latest master? Or my WIP branch against the latest nostrdb-update?
On Wed, Jun 04, 2025 at 12:09:01PM -0700, Daniel D’Aquino wrote:
What were you rebasing?
nostrdb-updateagainst the latestmaster? Or my WIP branch against the latestnostrdb-update?
nostrdb-update on master
I may have a lead on a possible cause for the crashes. Experimenting…
Applied some tweaks that does seem to stabilize a bit and reduce crashes, but some crashes remain. The investigation continues.
Found more detailed debugging info by testing it with the address sanitizer on.
Reproduction steps:
Damus: 94e5b67c533e988766eb3e186387c8ff3d2d1f71 (Latest in https://github.com/damus-io/damus/pull/2963 so far)
Setup: Address sanitizer enabled
Steps:
- Scroll down quickly until error is hit
Reproduced 3/3 times, and they all point to this portion of the code:
// send a copy to the writer
blocks_size = ndb_blocks_total_size(blocks);
blocks_to_writer = malloc(blocks_size);
memcpy(blocks_to_writer, blocks, blocks_size); <--- Address sanitizer detects a heap buffer overflow here
assert(blocks->flags & NDB_BLOCK_FLAG_OWNED);
I am investigating if there are any incorrect memory calculations on that area of code that could be causing the issue.
@jb55, I believe I have found the root cause of the remaining crash.
Root cause
The address sanitizer detects a heap buffer overflow is this portion of nostrdb.c:
// send a copy to the writer
blocks_size = ndb_blocks_total_size(blocks);
blocks_to_writer = malloc(blocks_size);
memcpy(blocks_to_writer, blocks, blocks_size); <--- Address sanitizer detects a heap buffer overflow here
assert(blocks->flags & NDB_BLOCK_FLAG_OWNED);
- The destination
blocks_to_writerreceivesblocks_sizebytes, and hasblocks_sizeallocated, so there is no issue there - Therefore, one of two things could be incorrect:
1. The blocks_size calculation
The ndb_blocks_total_size() function — responsible for determining blocks_size — only gets the total_size member of the blocks:
// total size including padding
size_t ndb_blocks_total_size(struct ndb_blocks *blocks) {
assert(blocks->total_size < 1000000);
return blocks->total_size;
}
which is set during parsing of content:
int ndb_parse_content(unsigned char *buf, int buf_size,
const char *content, int content_len,
struct ndb_blocks **blocks_p)
{
(...)
//
// pad to 8-byte alignment
//
if (!cursor_align(&parser.buffer, 8))
return 0;
assert((parser.buffer.p - parser.buffer.start) % 8 == 0);
parser.blocks->total_size = parser.buffer.p - parser.buffer.start;
(...)
OR:
2. The allocated space for blocks
The other potential cause is that the initial allocation for blocks does not reserve enough memory to hold the data for parsing.
Memory for blocks is allocated in ndb_note_to_blocks:
static struct ndb_blocks *ndb_note_to_blocks(struct ndb_note *note)
{
const char *content;
size_t content_len;
struct ndb_blocks *blocks;
content = ndb_note_content(note);
content_len = ndb_note_content_length(note);
// something weird is going on
if (content_len >= INT32_MAX)
return NULL;
unsigned char *buffer = malloc(content_len); // <--- Allocated here
if (!buffer)
return NULL;
if (!ndb_parse_content(buffer, content_len, content, content_len, &blocks)) {
free(buffer);
return NULL;
}
//blocks = realloc(blocks, ndb_blocks_total_size(blocks)); // <--- Interesting
//if (blocks == NULL)
//return NULL;
blocks->flags |= NDB_BLOCK_FLAG_OWNED;
return blocks;
}
There is some code commented out for reallocating memory based on the calculated total size. I decided not to re-enable this because this commit message from 2712edb indicates it causes a separate crash:
nostrdb: fix realloc corruption
can't figure out why this is happening, but let's disable it for now
while we test. we shouldn't hit this code path anyways once we switch
over to local notes in damus ios
However, this points in the direction that perhaps content_len of the note may not be the right number of bytes for buffer.
I decided to make an experiment to test this:
Experiment
Damus: 94e5b67c533e988766eb3e186387c8ff3d2d1f71 (Latest in https://github.com/damus-io/damus/pull/2963 so far) with the following local change:
diff --git a/nostrdb/src/nostrdb.c b/nostrdb/src/nostrdb.c
index f49d66d1..3e557126 100644
--- a/nostrdb/src/nostrdb.c
+++ b/nostrdb/src/nostrdb.c
@@ -7095,7 +7095,7 @@ static struct ndb_blocks *ndb_note_to_blocks(struct ndb_note *note)
if (content_len >= INT32_MAX)
return NULL;
- unsigned char *buffer = malloc(content_len);
+ unsigned char *buffer = malloc(2<<18);
if (!buffer)
return NULL;
Setup: Address sanitizer enabled Steps:
- Scroll down quickly until error is hit
Result: Could not reproduce the issue after trying to quickly scroll through notes for more than 10 minutes
@jb55, I believe this remaining crash is due to insufficient memory being allocated to blocks. I am not sure if the best solution is to simply allocate more memory at the first allocation, or if we should try re-enabling the realloc call and investigating the other crash you saw when enabling that. What are your thoughts on this?
On Wed, Jun 04, 2025 at 08:46:45PM -0700, Daniel D’Aquino wrote:
@jb55, I believe this remaining crash is due to insufficient memory being allocated to
blocks. I am not sure if the best solution is to simply allocate more memory at the first allocation, or if we should try re-enabling therealloccall and investigating the other crash you saw when enabling that. What are your thoughts on this?
I remember running into this and debugging the same thing. I think this is why I started working toward the local relay model so I could avoid this fallback block parsing function...
I couldn't figure out where the corruption was coming from.
It's weird that 500kib wouldn't be enough for the note blocks... you could try to increase it to see if that fixes it.
It's weird that 500kib wouldn't be enough for the note blocks... you could try to increase it to see if that fixes it.
Sorry, I should have written my result with more clarity. Raising the allocation to 512kib did solve the problem, there were no crashes as soon as I made that change, which means that it fixes it! What did not seem to be enough is content_len, the value that was set before this experiment.
However, 512kib was just a high value I made up for testing, I haven't figured out the proper value — or if reallocating later is a more proper fix.
On Thu, Jun 05, 2025 at 09:29:17AM -0700, Daniel D’Aquino wrote:
danieldaquino left a comment (damus-io/damus#2885)
It's weird that 500kib wouldn't be enough for the note blocks... you could try to increase it to see if that fixes it.
Sorry, I should have written my result with more clarity. Raising the allocation to 512kib did solve the problem, there were no crashes as soon as I made that change, which means that it fixes it! What did not seem to be enough is
content_len, the value that was set before this experiment.
oh ok perfect! I'm fine with a fixed value.
Refined PR and it is ready for review!
All details can be found in https://github.com/damus-io/damus/pull/2963.