lemmy icon indicating copy to clipboard operation
lemmy copied to clipboard

Constant errors from federation send worker

Open Nutomic opened this issue 3 weeks ago • 4 comments

Voyager constantly throws this error in a loop, which causes significant cpu usage:

lemmy-1  |
lemmy-1  | Caused by:
lemmy-1  |     err getting activity: LemmyError { message: NotFound, caller: /lemmy/crates/utils/src/error.rs:278:20, inner: Record not found
lemmy-1  |
lemmy-1  |     Stack backtrace:
lemmy-1  |        0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
lemmy-1  |        1: <core::result::Result<T,E> as lemmy_utils::error::LemmyErrorExt<T,E>>::with_lemmy_type::{{closure}}
lemmy-1  |        2: moka::future::value_initializer::ValueInitializer<K,V,S>::try_init_or_read::{{closure}}
lemmy-1  |        3: lemmy_apub_send::worker::InstanceWorker::spawn_send_if_needed::{{closure}}
lemmy-1  |        4: lemmy_apub_send::worker::InstanceWorker::loop_until_stopped::{{closure}}
lemmy-1  |        5: lemmy_apub_send::util::CancellableTask::spawn::{{closure}}
lemmy-1  |        6: tokio::runtime::task::raw::poll
lemmy-1  |        7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
lemmy-1  |        8: tokio::runtime::task::raw::poll
lemmy-1  |        9: std::sys::backtrace::__rust_begin_short_backtrace
lemmy-1  |       10: core::ops::function::FnOnce::call_once{{vtable.shim}}
lemmy-1  |       11: std::sys::thread::unix::Thread::new::thread_start
lemmy-1  |       12: <unknown>
lemmy-1  |       13: __clone }
lemmy-1  |
lemmy-1  | Stack backtrace:
lemmy-1  |    0: anyhow::error::<impl anyhow::Error>::msg
lemmy-1  |    1: anyhow::__private::format_err.11233
lemmy-1  |    2: lemmy_apub_send::util::get_activity_cached::{{closure}}::{{closure}}
lemmy-1  |    3: lemmy_apub_send::worker::InstanceWorker::spawn_send_if_needed::{{closure}}
lemmy-1  |    4: lemmy_apub_send::worker::InstanceWorker::loop_until_stopped::{{closure}}
lemmy-1  |    5: lemmy_apub_send::util::CancellableTask::spawn::{{closure}}
lemmy-1  |    6: tokio::runtime::task::raw::poll
lemmy-1  |    7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
lemmy-1  |    8: tokio::runtime::task::raw::poll
lemmy-1  |    9: std::sys::backtrace::__rust_begin_short_backtrace
lemmy-1  |   10: core::ops::function::FnOnce::call_once{{vtable.shim}}
lemmy-1  |   11: std::sys::thread::unix::Thread::new::thread_start
lemmy-1  |   12: <unknown>
lemmy-1  |   13: __clone })

As a workaround I set LEMMY_DISABLE_ACTIVITY_SENDING=true for now.

Nutomic avatar Dec 04 '25 09:12 Nutomic

cc @phiresky

dessalines avatar Dec 05 '25 14:12 dessalines

For the moment this fixed itself. If the test server uses 100% cpu again it might be this problem and we need to check the logs. With the better error message we can hopefully narrow it down.

Nutomic avatar Dec 05 '25 14:12 Nutomic

If I'm reading it right, this PR https://github.com/LemmyNet/lemmy/pull/5667/files#diff-899cee276e9713c57bb5ff473ec9bd72df7831111780c0d2de7a543544667b7e likely unintentionally replaced a return Ok(None) with an effective return Err(), raising this error

phiresky avatar Dec 05 '25 14:12 phiresky

See also the comment

/// May return None if the corresponding id does not exist or is a received activity.
/// Holes in serials are expected behaviour in postgresql

That silences this error, but this must be caused by the last_send_id lagging behind by some huge amount (like restarting from 0) over the actual current sequence of the id, which then causes it to iterate over all integers inbetween - 100% cpu.

I'd wager this is likely related to the changes that were made to the scheduled task deleting activities, where it's easy to have an issue of the last_sent_id becoming a wrong value. I remember suggesting a solution of always keeping at least one row alive but afaik it deletes everything now and last sent id is reset somehow, don't remember exactly.

Alternatively ofc it could be due to some manual DB interventions that happened on the test instance.

phiresky avatar Dec 05 '25 14:12 phiresky

The error hasnt happened anymore. It could also be related to the use of allowlist federation which is rarely used in production.

Nutomic avatar Dec 17 '25 11:12 Nutomic