bupstash
bupstash copied to clipboard
EAGAIN / EWOULDBLOCK when writing logs to stderr, causing failure
Separating out the issue from #367. Backtrace:
Resource temporarily unavailable (os error 11)
Stack backtrace:
0: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
1: bupstash::client::SendSession::send_dir
2: bupstash::client::send
3: bupstash::put_main
4: bupstash::main
5: std::sys_common::backtrace::__rust_begin_short_backtrace
6: main
I'll try to get a better trace with line numbers, but this might already be sufficient for tracking it down.
This might very well be an error caused on my side due to how I deploy it (systemd service), but currently I don't understand enough about what's going wrong there to be sure.
Got another backtrace, not sure if a good one:
0: bupstash::put_main::{{closure}}
1: bupstash::put::BatchFileProcessor::process_batch
2: core::ops::function::FnOnce::call_once{{vtable.shim}}
3: std::sys_common::backtrace::__rust_begin_short_backtrace
4: core::ops::function::FnOnce::call_once{{vtable.shim}}
5: std::sys::unix::thread::Thread::new::thread_start
6: start_thread
7: clone3
It looks like your OS is not letting you create new processes/threads - you have run out of them for some reason.
I think the solution is to find every place bupstash is spawning threads and add better error messages and also reevaluate the defaults for how many threads are being created.
For your system as a work around I think you need to find the right resource limit to increase.
I think it can't be threads, because then it would be a panic, but instead put fails on an Err
Is the error in clone3? that looks like creating a thread, see https://man.archlinux.org/man/clone3.2.en
You are reading the backtrace backwards, the most recent call is at the top. So clone is not where it ends, but instead where it starts, which makes sense. I added some .context to various error sites:
bupstash put: Failed to backup the data
Caused by:
0: Failed to process a batch
1: Failed to log file action
2: Resource temporarily unavailable (os error 11)
Stack backtrace:
0: bupstash::put_main::{{closure}}
1: bupstash::put::BatchFileProcessor::process_batch
2: core::ops::function::FnOnce::call_once{{vtable.shim}}
3: std::sys_common::backtrace::__rust_begin_short_backtrace
4: core::ops::function::FnOnce::call_once{{vtable.shim}}
5: std::sys::unix::thread::Thread::new::thread_start
6: start_thread
7: clone3
So the error happens while trying to log the file action to stderr for some reason. Now I can finally debug why that happens (probably some resource limit somewhere?) and how to best fix it
As per https://github.com/travis-ci/travis-ci/issues/4704#issuecomment-348435959, it might be the case that stderr is somehow opened in non-blocking mode (n.b. "Resource temporarily unavailable" is the name for EAGAIN as per errno -l). And it works fine most of the time because the buffer is usually large enough, except when logging a lot of information at once. I see that we are creating a progress bar even if it is hidden, maybe the progress libraries below set the file to nonblocking?
Great detective work - sorry for reading it backwards - must be tired. That would make some sense, stderr might be backed up on a busy system when writing to the systemd journal.
fcntl(2, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
I can now confirm that stderr is set to non-blocking mode for some reason, that this is the failure cause and that it already is set like that from the very first log_fn call.
I am not sure why and how and where it is set to non-blocking though, my strace skills are honestly not strong enough for that. (I thought I found something interesting, then it turned out to be in the ssh sub-process and thus irrelevant)
Okay, I at least managed to narrow it down by spamming debug assertions. It fails on the second assert in protocol.rs, the first time that function is called:
fn read_from_remote(r: &mut dyn std::io::Read, buf: &mut [u8]) -> Result<(), anyhow::Error> {
unsafe { assert_ne!(libc::fcntl(2, libc::F_GETFL), 0x802); /* O_RDWR|O_NONBLOCK*/ }
if r.read_exact(buf).is_err() {
anyhow::bail!("remote disconnected")
}
unsafe { assert_ne!(libc::fcntl(2, libc::F_GETFL), 0x802); /* O_RDWR|O_NONBLOCK*/ }
Ok(())
}
Full stack trace:
thread 'main' panicked at 'assertion failed: `(left != right)`
left: `2050`,
right: `2050`', src/protocol.rs:305:14
stack backtrace:
0: 0x55eb4743ac3c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::haaf5fb1112692432
1: 0x55eb4737923e - core::fmt::write::hc9810906af1f037c
2: 0x55eb47451c74 - std::io::Write::write_fmt::h42570a50d1c05a1d
3: 0x55eb4743aa0f - std::sys_common::backtrace::print::hdb4dd2f5727cb1da
4: 0x55eb4746038f - std::panicking::default_hook::{{closure}}::hbf6435d4555f4c42
5: 0x55eb47460eec - std::panicking::rust_panic_with_hook::h2dbb792ca6451cd5
6: 0x55eb4743afe4 - std::panicking::begin_panic_handler::{{closure}}::h78627a3718fcdf2c
7: 0x55eb4743af4c - std::sys_common::backtrace::__rust_end_short_backtrace::h79f584276ca2060b
8: 0x55eb47460771 - rust_begin_unwind
9: 0x55eb471c9192 - core::panicking::panic_fmt::ha005c52a737c94d3
10: 0x55eb471c9323 - core::panicking::assert_failed_inner::h01d6713db4736cd7
11: 0x55eb471bea1a - core::panicking::assert_failed::hb4408f4384f13ece
12: 0x55eb472b51d5 - bupstash::protocol::read_from_remote::hb47b00bce27d95cd
13: 0x55eb472b542a - bupstash::protocol::read_packet::h04d4fc00400e1905
14: 0x55eb472f25f1 - bupstash::cli_to_opened_serve_process::hac17ab272c1b828b
15: 0x55eb472fbfaf - bupstash::put_main::h4767936ffc607059
16: 0x55eb47316a2b - bupstash::main::h228df2133ce59ea5
17: 0x55eb4720af83 - std::sys_common::backtrace::__rust_begin_short_backtrace::h363cb0560f2c5393
18: 0x55eb4732ce24 - main
19: 0x7fd7e4c2924e - __libc_start_call_main
20: 0x7fd7e4c29309 - __libc_start_main_alias_1
21: 0x55eb471d35b5 - _start
22: 0x0 - <unknown>
So somehow the Read that is passed in modifies the socket on the first usage? And moreover, as far as I can tell that function should be only using stdin/stdout of a piped sub process, which has nothing to do with the actual stderr?!
I thought I found something interesting, then it turned out to be in the ssh sub-process and thus irrelevant
Well, turns out that fd properties may be shared across processes, so this is not only relevant but turns out to actually be the root cause. #382 has a workaround-fix for it.
I think I may also be running into this. Tried setting up a systemd service but it won't run for more than a second or so.