damus icon indicating copy to clipboard operation
damus copied to clipboard

nostrdb update: note block rendering issue

Open jb55 opened this issue 9 months ago • 10 comments

It seems on the new nostrdb update branch, the note blocks are not immediately available sometimes. This makes sense, since we have not switch to the local relay model yet.

Let's just make sure we have lazy block parsing working like we did before.

jb55 avatar Feb 26 '25 07:02 jb55

@jb55, after analyzing and testing the nostrdb branch, I am not sure I fully understand what the issue is here.

  1. I can't reproduce the problem on iOS, notes seem to be rendering as expected as far as I have been trying things out.
  2. From studying the code changes, it seems like we already have lazy block parsing (More on that below).

1. Repro attempt

Damus: 09cfb332b3d715d7663a8bff826c488ba85f875f Device: iPhone SE simulator Steps:

  1. Scroll down on the timeline for about 3 minutes, checking whether note artifacts load properly.

Result: Notes seem to be loading as expected.


2. Code analysis

Damus: 09cfb332b3d715d7663a8bff826c488ba85f875f

  1. NoteContentView will call load when it appears, and when it receives a profile_updated event.
  2. load has some loading and planning logic depending on the state, but it seems that it ultimately calls render_note_content for loading the note.
  3. render_note_content calls ev.blocks(ndb: ndb), which calls lookup_blocks_by_key_with_txn, which in turn calls ndb_get_blocks_by_key.
  4. ndb_get_blocks_by_key has this portion of code which indicates the blocks are lazily generated if not on the database already:
        if ((blocks = ndb_lookup_by_key(txn, note_key, NDB_DB_NOTE_BLOCKS, &note_len))) {
		return blocks;
	}

        // If we don't have note blocks, let's lazily generate them. This is
	// migration-friendly instead of doing them all at once
	if (!(note = ndb_get_note_by_key(txn, note_key, &note_len))) {
		// no note found, can't return note blocks
		return NULL;
	}

	 if (!(blocks = ndb_note_to_blocks(note)))
		 return NULL;

So it seems that it should lazily generate the note blocks even if they are not immediately available on the database. Where I see this could fail is if the note itself is not available on the database, or if there is some form of error somewhere during parsing, and the propagation of that error causes ev.blocks(ndb: ndb) to return nil.

Can you please help me understand what the issue is?

danieldaquino avatar Apr 03 '25 20:04 danieldaquino

Made a very WIP draft (that does not fully work) for one of the possible solutions: https://github.com/damus-io/damus/pull/2963.

danieldaquino avatar Apr 05 '25 02:04 danieldaquino

Solved some issues related to memory leaks, large CPU usage, and notes getting stuck while rendering.

Some notes are still getting stuck at rendering after scrolling down for a minute. It seems like I am running into some form of deadlock in NostrDB associated with a ndb_wait_for_notes call where all notes suddenly stop loading.

@jb55, are you aware of any potential pitfalls or things that could cause ndb_wait_for_notes to deadlock Ndb altogether?

danieldaquino avatar Apr 08 '25 01:04 danieldaquino

On Mon, Apr 07, 2025 at 06:44:58PM -0700, Daniel D’Aquino wrote:

danieldaquino left a comment (damus-io/damus#2885)

Solved some issues related to memory leaks, large CPU usage, and notes getting stuck while rendering.

Some notes are still getting stuck at rendering after scrolling down for a minute. It seems like I am running into some form of deadlock in NostrDB associated with a ndb_wait_for_notes call where all notes suddenly stop loading.

@jb55, are you aware of any potential pitfalls or things that could cause ndb_wait_for_notes to deadlock Ndb altogether?

I was going to remove wait_for_notes. what are you using it for? you should use the callback mechanism instead to get woken up when notes are received off of subscriptions.

jb55 avatar Apr 08 '25 23:04 jb55

one potential deadlock issue I see is that our ndb_notify_subscriptions calls the subscription callback while holding the monitor lock. This means if any callback code tries to poll for notes or wait for notes, unsubscribe or subscribe, it would deadlock...

do we have a subscription callback registered ?

I think the fix for this would be to call monitor->sub_cb outside of the lock at the bottom of ndb_notify_subscriptions

jb55 avatar Apr 08 '25 23:04 jb55

hmm it does not look like we are registering a subscription callback... but yeah we should be using that instead. wait_for_notes is very inefficient as it will block the thread and have massive contention on the monitor lock and can block other threads. in fact, i would go as far as to say wait_for_notes should never be used in a multithreaded context, which is why I was going to remove it.

jb55 avatar Apr 09 '25 00:04 jb55

sub callback is here:

var cfg = ndb_config(flags: 0, ingester_threads: ingest_threads, mapsize: mapsize, filter_context: nil, ingest_filter: nil, sub_cb_ctx: nil, sub_cb: nil)

jb55 avatar Apr 09 '25 00:04 jb55

@jb55, switched over to the subscription callback model and now it works very well!

In brief, this is how I achieved that:

  1. I have an object that maps subscription IDs to callbacks, suitable for handling the global callback:
+class NdbCallbackCaller {
+    var ndb: ndb_t? = nil
+    var subscriptionCallbackMap: [UInt64: (NoteKey) -> Void] = [:]
+    
+    func handleSubscriptionCallback(subId: UInt64) {
+        if let callback = subscriptionCallbackMap[subId] {
+            let result = UnsafeMutablePointer<UInt64>.allocate(capacity: 1)
+            if let ndb {
+                ndb_poll_for_notes(ndb.ndb, subId, result, 1)
+                let noteKey: NoteKey = result.pointee
+                callback(noteKey)
+            }
+            result.deallocate()
+        }
+    }
+}
+
  1. I created the "trampoline" function in Swift that can be called back from C:
+// This C-exported callback function will be called from the C library when new notes arrive.
+@_cdecl("subscription_callback")
+public func subscription_callback(ctx: UnsafeMutableRawPointer?, subid: UInt64) {
+    guard let ctx = ctx else { return }
+    let holder = Unmanaged<NdbCallbackCaller>.fromOpaque(ctx).takeUnretainedValue()
+    holder.handleSubscriptionCallback(subId: subid)
+}
  1. I added this subscription mapping/handler object to our Ndb instance and point the callback to it in the C library:
-    static func open(path: String? = nil, owns_db_file: Bool = true) -> ndb_t? {
+    static func open(path: String? = nil, owns_db_file: Bool = true, subscriptionCaller: NdbCallbackCaller) -> ndb_t? {
         var ndb_p: OpaquePointer? = nil
 
         let ingest_threads: Int32 = 4
@@ -111,6 +145,8 @@ class Ndb {
             var ok = false
             while !ok && mapsize > 1024 * 1024 * 700 {
                 var cfg = ndb_config(flags: 0, ingester_threads: ingest_threads, mapsize: mapsize, filter_context: nil, ingest_filter: nil, sub_cb_ctx: nil, sub_cb: nil)
+                let ctx: UnsafeMutableRawPointer = Unmanaged.passRetained(subscriptionCaller).toOpaque()
+                ndb_config_set_subscription_callback(&cfg, subscription_callback, ctx)
                 let res = ndb_init(&ndb_p, testdir, &cfg);
                 ok = res != 0;
                 if !ok {
 class Ndb {
     var ndb: ndb_t
     let path: String?
     let owns_db: Bool
     var generation: Int
     private var closed: Bool
+    var subscriptionCaller: NdbCallbackCaller
 
     var is_closed: Bool {
         self.closed || self.ndb.ndb == nil
     }
-
+    
+    func setSubscriptionCallback(subId: UInt64, _ callback: @escaping (NoteKey) -> Void) {
+        self.subscriptionCaller.subscriptionCallbackMap[subId] = callback
+    }
+    
+    func unsetSubscriptionCallback(subId: UInt64) {
+        self.subscriptionCaller.subscriptionCallbackMap[subId] = nil
+    }
+    
  1. And this allows me to easily integrate this with AsyncStream interfaces in Swift, like so:
return AsyncStream<StreamItem> { continuation in
            (...) // Stream initial results
            
            // Stream new results
            let subid = ndb_subscribe(self.ndb.ndb, filtersPointer, Int32(filters.count))
            
            // Set the subscription callback
            self.setSubscriptionCallback(subId: subid, { noteKey in
                continuation.yield(.event(noteKey))
            })
            
            // On termination, cancel the subscription and cleanup.
            continuation.onTermination = { @Sendable _ in
                ndb_unsubscribe(self.ndb.ndb, subid)
                self.unsetSubscriptionCallback(subId: subid)
                (...)
            }
        }

I will start cleaning up the draft!

danieldaquino avatar Apr 09 '25 22:04 danieldaquino

On Wed, Apr 09, 2025 at 03:02:16PM -0700, Daniel D’Aquino wrote:

danieldaquino left a comment (damus-io/damus#2885)

@jb55, switched over to the subscription callback model and now it works very well!

In brief, this is how I achieved that:

  1. I have an object that maps subscription IDs to callbacks, suitable for handling the global callback:
+class NdbCallbackCaller {
+    var ndb: ndb_t? = nil
+    var subscriptionCallbackMap: [UInt64: (NoteKey) -> Void] = [:]
+
+    func handleSubscriptionCallback(subId: UInt64) {
+        if let callback = subscriptionCallbackMap[subId] {
+            let result = UnsafeMutablePointer<UInt64>.allocate(capacity: 1)
+            if let ndb {
+                ndb_poll_for_notes(ndb.ndb, subId, result, 1)

The only thing that makes be a bit uncomfortable is that you're only polling single notes off the queue. There could be more than one note in there, so this could lead to missed notes. There is not a unique one to one mapping from callbacks to notes.

You should either pass a list of notes in the callback (most efficient), or call the callback multiple times for each notes polled from poll_for_notes

jb55 avatar Apr 09 '25 22:04 jb55

Came back to this, and did some refactoring to better separate the "dangerous" C/Swift interface logic (Dealing with unsafe pointers) from the rest of the logic. I am still running into some random crashes. Debugging…

danieldaquino avatar Jun 03 '25 02:06 danieldaquino

do you have a branch for this? I want to help debug

jb55 avatar Jun 03 '25 13:06 jb55

i tried to rebase this yesterday but was a bit hairy

jb55 avatar Jun 03 '25 14:06 jb55

do you have a branch for this? I want to help debug

I currently have the WIP at https://github.com/damus-io/damus/pull/2963/commits

i tried to rebase this yesterday but was a bit hairy

What were you rebasing? nostrdb-update against the latest master? Or my WIP branch against the latest nostrdb-update?

danieldaquino avatar Jun 04 '25 19:06 danieldaquino

On Wed, Jun 04, 2025 at 12:09:01PM -0700, Daniel D’Aquino wrote:

What were you rebasing? nostrdb-update against the latest master? Or my WIP branch against the latest nostrdb-update?

nostrdb-update on master

jb55 avatar Jun 04 '25 19:06 jb55

I may have a lead on a possible cause for the crashes. Experimenting…

danieldaquino avatar Jun 04 '25 21:06 danieldaquino

Applied some tweaks that does seem to stabilize a bit and reduce crashes, but some crashes remain. The investigation continues.

danieldaquino avatar Jun 05 '25 00:06 danieldaquino

Found more detailed debugging info by testing it with the address sanitizer on.

Reproduction steps: Damus: 94e5b67c533e988766eb3e186387c8ff3d2d1f71 (Latest in https://github.com/damus-io/damus/pull/2963 so far) Setup: Address sanitizer enabled Steps:

  1. Scroll down quickly until error is hit

Reproduced 3/3 times, and they all point to this portion of the code:

// send a copy to the writer
blocks_size = ndb_blocks_total_size(blocks);
blocks_to_writer = malloc(blocks_size);
memcpy(blocks_to_writer, blocks, blocks_size); <--- Address sanitizer detects a heap buffer overflow here
assert(blocks->flags & NDB_BLOCK_FLAG_OWNED);

I am investigating if there are any incorrect memory calculations on that area of code that could be causing the issue.

danieldaquino avatar Jun 05 '25 01:06 danieldaquino

@jb55, I believe I have found the root cause of the remaining crash.

Root cause

The address sanitizer detects a heap buffer overflow is this portion of nostrdb.c:

// send a copy to the writer
blocks_size = ndb_blocks_total_size(blocks);
blocks_to_writer = malloc(blocks_size);
memcpy(blocks_to_writer, blocks, blocks_size); <--- Address sanitizer detects a heap buffer overflow here
assert(blocks->flags & NDB_BLOCK_FLAG_OWNED);
  1. The destination blocks_to_writer receives blocks_size bytes, and has blocks_size allocated, so there is no issue there
  2. Therefore, one of two things could be incorrect:

1. The blocks_size calculation

The ndb_blocks_total_size() function — responsible for determining blocks_size — only gets the total_size member of the blocks:

// total size including padding
size_t ndb_blocks_total_size(struct ndb_blocks *blocks) {
	assert(blocks->total_size < 1000000);
	return blocks->total_size;
}

which is set during parsing of content:

int ndb_parse_content(unsigned char *buf, int buf_size,
		      const char *content, int content_len,
		      struct ndb_blocks **blocks_p)
{
	(...)

	//
	// pad to 8-byte alignment
	//
	if (!cursor_align(&parser.buffer, 8))
		return 0;
	assert((parser.buffer.p - parser.buffer.start) % 8 == 0);
	parser.blocks->total_size = parser.buffer.p - parser.buffer.start;
        (...)

OR:

2. The allocated space for blocks

The other potential cause is that the initial allocation for blocks does not reserve enough memory to hold the data for parsing.

Memory for blocks is allocated in ndb_note_to_blocks:

static struct ndb_blocks *ndb_note_to_blocks(struct ndb_note *note)
{
	const char *content;
	size_t content_len;
	struct ndb_blocks *blocks;

	content = ndb_note_content(note);
	content_len = ndb_note_content_length(note);

	// something weird is going on
	if (content_len >= INT32_MAX)
		return NULL;

    unsigned char *buffer = malloc(content_len); // <--- Allocated here
	if (!buffer)
		return NULL;

	if (!ndb_parse_content(buffer, content_len, content, content_len, &blocks)) {
		free(buffer);
		return NULL;
	}

	//blocks = realloc(blocks, ndb_blocks_total_size(blocks));  // <--- Interesting
	//if (blocks == NULL)
		//return NULL;

	blocks->flags |= NDB_BLOCK_FLAG_OWNED;

	return blocks;
}

There is some code commented out for reallocating memory based on the calculated total size. I decided not to re-enable this because this commit message from 2712edb indicates it causes a separate crash:

nostrdb: fix realloc corruption

can't figure out why this is happening, but let's disable it for now
while we test. we shouldn't hit this code path anyways once we switch
over to local notes in damus ios

However, this points in the direction that perhaps content_len of the note may not be the right number of bytes for buffer.

I decided to make an experiment to test this:

Experiment

Damus: 94e5b67c533e988766eb3e186387c8ff3d2d1f71 (Latest in https://github.com/damus-io/damus/pull/2963 so far) with the following local change:

diff --git a/nostrdb/src/nostrdb.c b/nostrdb/src/nostrdb.c
index f49d66d1..3e557126 100644
--- a/nostrdb/src/nostrdb.c
+++ b/nostrdb/src/nostrdb.c
@@ -7095,7 +7095,7 @@ static struct ndb_blocks *ndb_note_to_blocks(struct ndb_note *note)
    if (content_len >= INT32_MAX)
        return NULL;

-   unsigned char *buffer = malloc(content_len);
+    unsigned char *buffer = malloc(2<<18);
    if (!buffer)
        return NULL;

Setup: Address sanitizer enabled Steps:

  1. Scroll down quickly until error is hit

Result: Could not reproduce the issue after trying to quickly scroll through notes for more than 10 minutes


@jb55, I believe this remaining crash is due to insufficient memory being allocated to blocks. I am not sure if the best solution is to simply allocate more memory at the first allocation, or if we should try re-enabling the realloc call and investigating the other crash you saw when enabling that. What are your thoughts on this?

danieldaquino avatar Jun 05 '25 03:06 danieldaquino

On Wed, Jun 04, 2025 at 08:46:45PM -0700, Daniel D’Aquino wrote:

@jb55, I believe this remaining crash is due to insufficient memory being allocated to blocks. I am not sure if the best solution is to simply allocate more memory at the first allocation, or if we should try re-enabling the realloc call and investigating the other crash you saw when enabling that. What are your thoughts on this?

I remember running into this and debugging the same thing. I think this is why I started working toward the local relay model so I could avoid this fallback block parsing function...

I couldn't figure out where the corruption was coming from.

It's weird that 500kib wouldn't be enough for the note blocks... you could try to increase it to see if that fixes it.

jb55 avatar Jun 05 '25 15:06 jb55

It's weird that 500kib wouldn't be enough for the note blocks... you could try to increase it to see if that fixes it.

Sorry, I should have written my result with more clarity. Raising the allocation to 512kib did solve the problem, there were no crashes as soon as I made that change, which means that it fixes it! What did not seem to be enough is content_len, the value that was set before this experiment.

However, 512kib was just a high value I made up for testing, I haven't figured out the proper value — or if reallocating later is a more proper fix.

danieldaquino avatar Jun 05 '25 16:06 danieldaquino

On Thu, Jun 05, 2025 at 09:29:17AM -0700, Daniel D’Aquino wrote:

danieldaquino left a comment (damus-io/damus#2885)

It's weird that 500kib wouldn't be enough for the note blocks... you could try to increase it to see if that fixes it.

Sorry, I should have written my result with more clarity. Raising the allocation to 512kib did solve the problem, there were no crashes as soon as I made that change, which means that it fixes it! What did not seem to be enough is content_len, the value that was set before this experiment.

oh ok perfect! I'm fine with a fixed value.

jb55 avatar Jun 05 '25 16:06 jb55

Refined PR and it is ready for review!

All details can be found in https://github.com/damus-io/damus/pull/2963.

danieldaquino avatar Jun 06 '25 22:06 danieldaquino