bupstash EAGAIN / EWOULDBLOCK when writing logs to stderr, causing failure

Separating out the issue from #367. Backtrace:

Resource temporarily unavailable (os error 11)
Stack backtrace:
0: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
1: bupstash::client::SendSession::send_dir
2: bupstash::client::send
3: bupstash::put_main
4: bupstash::main
5: std::sys_common::backtrace::__rust_begin_short_backtrace
6: main

I'll try to get a better trace with line numbers, but this might already be sufficient for tracking it down.

This might very well be an error caused on my side due to how I deploy it (systemd service), but currently I don't understand enough about what's going wrong there to be sure.

Feb 26 '23 21:02 piegamesde

Got another backtrace, not sure if a good one:

0: bupstash::put_main::{{closure}}
1: bupstash::put::BatchFileProcessor::process_batch
2: core::ops::function::FnOnce::call_once{{vtable.shim}}
3: std::sys_common::backtrace::__rust_begin_short_backtrace
4: core::ops::function::FnOnce::call_once{{vtable.shim}}
5: std::sys::unix::thread::Thread::new::thread_start
6: start_thread
7: clone3

Apr 06 '23 09:04 piegamesde

It looks like your OS is not letting you create new processes/threads - you have run out of them for some reason.

I think the solution is to find every place bupstash is spawning threads and add better error messages and also reevaluate the defaults for how many threads are being created.

For your system as a work around I think you need to find the right resource limit to increase.

Apr 06 '23 09:04 andrewchambers

I think it can't be threads, because then it would be a panic, but instead put fails on an Err

Apr 11 '23 20:04 piegamesde

Is the error in clone3? that looks like creating a thread, see https://man.archlinux.org/man/clone3.2.en

Apr 11 '23 20:04 andrewchambers

You are reading the backtrace backwards, the most recent call is at the top. So clone is not where it ends, but instead where it starts, which makes sense. I added some .context to various error sites:

bupstash put: Failed to backup the data
Caused by:
    0: Failed to process a batch
    1: Failed to log file action
    2: Resource temporarily unavailable (os error 11)
Stack backtrace:
   0: bupstash::put_main::{{closure}}
   1: bupstash::put::BatchFileProcessor::process_batch
   2: core::ops::function::FnOnce::call_once{{vtable.shim}}
   3: std::sys_common::backtrace::__rust_begin_short_backtrace
   4: core::ops::function::FnOnce::call_once{{vtable.shim}}
   5: std::sys::unix::thread::Thread::new::thread_start
   6: start_thread
   7: clone3

So the error happens while trying to log the file action to stderr for some reason. Now I can finally debug why that happens (probably some resource limit somewhere?) and how to best fix it

As per https://github.com/travis-ci/travis-ci/issues/4704#issuecomment-348435959, it might be the case that stderr is somehow opened in non-blocking mode (n.b. "Resource temporarily unavailable" is the name for EAGAIN as per errno -l). And it works fine most of the time because the buffer is usually large enough, except when logging a lot of information at once. I see that we are creating a progress bar even if it is hidden, maybe the progress libraries below set the file to nonblocking?

Apr 12 '23 09:04 piegamesde

Great detective work - sorry for reading it backwards - must be tired. That would make some sense, stderr might be backed up on a busy system when writing to the systemd journal.

Apr 13 '23 06:04 andrewchambers

fcntl(2, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)

I can now confirm that stderr is set to non-blocking mode for some reason, that this is the failure cause and that it already is set like that from the very first log_fn call.

I am not sure why and how and where it is set to non-blocking though, my strace skills are honestly not strong enough for that. (I thought I found something interesting, then it turned out to be in the ssh sub-process and thus irrelevant)

Apr 14 '23 12:04 piegamesde

Okay, I at least managed to narrow it down by spamming debug assertions. It fails on the second assert in protocol.rs, the first time that function is called:

fn read_from_remote(r: &mut dyn std::io::Read, buf: &mut [u8]) -> Result<(), anyhow::Error> {
    unsafe { assert_ne!(libc::fcntl(2, libc::F_GETFL), 0x802); /* O_RDWR|O_NONBLOCK*/ }
    if r.read_exact(buf).is_err() {
        anyhow::bail!("remote disconnected")
    }
    unsafe { assert_ne!(libc::fcntl(2, libc::F_GETFL), 0x802); /* O_RDWR|O_NONBLOCK*/ }
    Ok(())
}

Full stack trace:

thread 'main' panicked at 'assertion failed: `(left != right)`
  left: `2050`,
 right: `2050`', src/protocol.rs:305:14
stack backtrace:
   0:     0x55eb4743ac3c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::haaf5fb1112692432
   1:     0x55eb4737923e - core::fmt::write::hc9810906af1f037c
   2:     0x55eb47451c74 - std::io::Write::write_fmt::h42570a50d1c05a1d
   3:     0x55eb4743aa0f - std::sys_common::backtrace::print::hdb4dd2f5727cb1da
   4:     0x55eb4746038f - std::panicking::default_hook::{{closure}}::hbf6435d4555f4c42
   5:     0x55eb47460eec - std::panicking::rust_panic_with_hook::h2dbb792ca6451cd5
   6:     0x55eb4743afe4 - std::panicking::begin_panic_handler::{{closure}}::h78627a3718fcdf2c
   7:     0x55eb4743af4c - std::sys_common::backtrace::__rust_end_short_backtrace::h79f584276ca2060b
   8:     0x55eb47460771 - rust_begin_unwind
   9:     0x55eb471c9192 - core::panicking::panic_fmt::ha005c52a737c94d3
  10:     0x55eb471c9323 - core::panicking::assert_failed_inner::h01d6713db4736cd7
  11:     0x55eb471bea1a - core::panicking::assert_failed::hb4408f4384f13ece
  12:     0x55eb472b51d5 - bupstash::protocol::read_from_remote::hb47b00bce27d95cd
  13:     0x55eb472b542a - bupstash::protocol::read_packet::h04d4fc00400e1905
  14:     0x55eb472f25f1 - bupstash::cli_to_opened_serve_process::hac17ab272c1b828b
  15:     0x55eb472fbfaf - bupstash::put_main::h4767936ffc607059
  16:     0x55eb47316a2b - bupstash::main::h228df2133ce59ea5
  17:     0x55eb4720af83 - std::sys_common::backtrace::__rust_begin_short_backtrace::h363cb0560f2c5393
  18:     0x55eb4732ce24 - main
  19:     0x7fd7e4c2924e - __libc_start_call_main
  20:     0x7fd7e4c29309 - __libc_start_main_alias_1
  21:     0x55eb471d35b5 - _start
  22:                0x0 - <unknown>

So somehow the Read that is passed in modifies the socket on the first usage? And moreover, as far as I can tell that function should be only using stdin/stdout of a piped sub process, which has nothing to do with the actual stderr?!

Apr 14 '23 13:04 piegamesde

I thought I found something interesting, then it turned out to be in the ssh sub-process and thus irrelevant

Well, turns out that fd properties may be shared across processes, so this is not only relevant but turns out to actually be the root cause. #382 has a workaround-fix for it.

Apr 17 '23 10:04 piegamesde

I think I may also be running into this. Tried setting up a systemd service but it won't run for more than a second or so.

Dec 25 '23 22:12 jtrees

bupstash bupstash copied to clipboard

EAGAIN / EWOULDBLOCK when writing logs to stderr, causing failure

bupstash
bupstash copied to clipboard