bupstash icon indicating copy to clipboard operation
bupstash copied to clipboard

EAGAIN / EWOULDBLOCK when writing logs to stderr, causing failure

Open piegamesde opened this issue 2 years ago • 10 comments

Separating out the issue from #367. Backtrace:

Resource temporarily unavailable (os error 11)
Stack backtrace:
0: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
1: bupstash::client::SendSession::send_dir
2: bupstash::client::send
3: bupstash::put_main
4: bupstash::main
5: std::sys_common::backtrace::__rust_begin_short_backtrace
6: main

I'll try to get a better trace with line numbers, but this might already be sufficient for tracking it down.

This might very well be an error caused on my side due to how I deploy it (systemd service), but currently I don't understand enough about what's going wrong there to be sure.

piegamesde avatar Feb 26 '23 21:02 piegamesde

Got another backtrace, not sure if a good one:

0: bupstash::put_main::{{closure}}
1: bupstash::put::BatchFileProcessor::process_batch
2: core::ops::function::FnOnce::call_once{{vtable.shim}}
3: std::sys_common::backtrace::__rust_begin_short_backtrace
4: core::ops::function::FnOnce::call_once{{vtable.shim}}
5: std::sys::unix::thread::Thread::new::thread_start
6: start_thread
7: clone3

piegamesde avatar Apr 06 '23 09:04 piegamesde

It looks like your OS is not letting you create new processes/threads - you have run out of them for some reason.

I think the solution is to find every place bupstash is spawning threads and add better error messages and also reevaluate the defaults for how many threads are being created.

For your system as a work around I think you need to find the right resource limit to increase.

andrewchambers avatar Apr 06 '23 09:04 andrewchambers

I think it can't be threads, because then it would be a panic, but instead put fails on an Err

piegamesde avatar Apr 11 '23 20:04 piegamesde

Is the error in clone3? that looks like creating a thread, see https://man.archlinux.org/man/clone3.2.en

andrewchambers avatar Apr 11 '23 20:04 andrewchambers

You are reading the backtrace backwards, the most recent call is at the top. So clone is not where it ends, but instead where it starts, which makes sense. I added some .context to various error sites:

bupstash put: Failed to backup the data
Caused by:
    0: Failed to process a batch
    1: Failed to log file action
    2: Resource temporarily unavailable (os error 11)
Stack backtrace:
   0: bupstash::put_main::{{closure}}
   1: bupstash::put::BatchFileProcessor::process_batch
   2: core::ops::function::FnOnce::call_once{{vtable.shim}}
   3: std::sys_common::backtrace::__rust_begin_short_backtrace
   4: core::ops::function::FnOnce::call_once{{vtable.shim}}
   5: std::sys::unix::thread::Thread::new::thread_start
   6: start_thread
   7: clone3

So the error happens while trying to log the file action to stderr for some reason. Now I can finally debug why that happens (probably some resource limit somewhere?) and how to best fix it


As per https://github.com/travis-ci/travis-ci/issues/4704#issuecomment-348435959, it might be the case that stderr is somehow opened in non-blocking mode (n.b. "Resource temporarily unavailable" is the name for EAGAIN as per errno -l). And it works fine most of the time because the buffer is usually large enough, except when logging a lot of information at once. I see that we are creating a progress bar even if it is hidden, maybe the progress libraries below set the file to nonblocking?

piegamesde avatar Apr 12 '23 09:04 piegamesde

Great detective work - sorry for reading it backwards - must be tired. That would make some sense, stderr might be backed up on a busy system when writing to the systemd journal.

andrewchambers avatar Apr 13 '23 06:04 andrewchambers

fcntl(2, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)

I can now confirm that stderr is set to non-blocking mode for some reason, that this is the failure cause and that it already is set like that from the very first log_fn call.

I am not sure why and how and where it is set to non-blocking though, my strace skills are honestly not strong enough for that. (I thought I found something interesting, then it turned out to be in the ssh sub-process and thus irrelevant)

piegamesde avatar Apr 14 '23 12:04 piegamesde

Okay, I at least managed to narrow it down by spamming debug assertions. It fails on the second assert in protocol.rs, the first time that function is called:

fn read_from_remote(r: &mut dyn std::io::Read, buf: &mut [u8]) -> Result<(), anyhow::Error> {
    unsafe { assert_ne!(libc::fcntl(2, libc::F_GETFL), 0x802); /* O_RDWR|O_NONBLOCK*/ }
    if r.read_exact(buf).is_err() {
        anyhow::bail!("remote disconnected")
    }
    unsafe { assert_ne!(libc::fcntl(2, libc::F_GETFL), 0x802); /* O_RDWR|O_NONBLOCK*/ }
    Ok(())
}

Full stack trace:

thread 'main' panicked at 'assertion failed: `(left != right)`
  left: `2050`,
 right: `2050`', src/protocol.rs:305:14
stack backtrace:
   0:     0x55eb4743ac3c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::haaf5fb1112692432
   1:     0x55eb4737923e - core::fmt::write::hc9810906af1f037c
   2:     0x55eb47451c74 - std::io::Write::write_fmt::h42570a50d1c05a1d
   3:     0x55eb4743aa0f - std::sys_common::backtrace::print::hdb4dd2f5727cb1da
   4:     0x55eb4746038f - std::panicking::default_hook::{{closure}}::hbf6435d4555f4c42
   5:     0x55eb47460eec - std::panicking::rust_panic_with_hook::h2dbb792ca6451cd5
   6:     0x55eb4743afe4 - std::panicking::begin_panic_handler::{{closure}}::h78627a3718fcdf2c
   7:     0x55eb4743af4c - std::sys_common::backtrace::__rust_end_short_backtrace::h79f584276ca2060b
   8:     0x55eb47460771 - rust_begin_unwind
   9:     0x55eb471c9192 - core::panicking::panic_fmt::ha005c52a737c94d3
  10:     0x55eb471c9323 - core::panicking::assert_failed_inner::h01d6713db4736cd7
  11:     0x55eb471bea1a - core::panicking::assert_failed::hb4408f4384f13ece
  12:     0x55eb472b51d5 - bupstash::protocol::read_from_remote::hb47b00bce27d95cd
  13:     0x55eb472b542a - bupstash::protocol::read_packet::h04d4fc00400e1905
  14:     0x55eb472f25f1 - bupstash::cli_to_opened_serve_process::hac17ab272c1b828b
  15:     0x55eb472fbfaf - bupstash::put_main::h4767936ffc607059
  16:     0x55eb47316a2b - bupstash::main::h228df2133ce59ea5
  17:     0x55eb4720af83 - std::sys_common::backtrace::__rust_begin_short_backtrace::h363cb0560f2c5393
  18:     0x55eb4732ce24 - main
  19:     0x7fd7e4c2924e - __libc_start_call_main
  20:     0x7fd7e4c29309 - __libc_start_main_alias_1
  21:     0x55eb471d35b5 - _start
  22:                0x0 - <unknown>

So somehow the Read that is passed in modifies the socket on the first usage? And moreover, as far as I can tell that function should be only using stdin/stdout of a piped sub process, which has nothing to do with the actual stderr?!

piegamesde avatar Apr 14 '23 13:04 piegamesde

I thought I found something interesting, then it turned out to be in the ssh sub-process and thus irrelevant

Well, turns out that fd properties may be shared across processes, so this is not only relevant but turns out to actually be the root cause. #382 has a workaround-fix for it.

piegamesde avatar Apr 17 '23 10:04 piegamesde

I think I may also be running into this. Tried setting up a systemd service but it won't run for more than a second or so.

jtrees avatar Dec 25 '23 22:12 jtrees