rustler icon indicating copy to clipboard operation
rustler copied to clipboard

NIF panics when producing too much input in too short of a time

Open hansihe opened this issue 8 years ago • 14 comments

Not sure if there is any way to actually fix this. This is probably because Erlang has stdout set to non-blocking mode, and a buffer gets filled up somewhere.

hansihe avatar Jan 19 '17 09:01 hansihe

Can you paste some example code that produces this scenario?

scrogson avatar Jan 19 '17 13:01 scrogson

@scrogson

pub fn print_panic<'a>(env: NifEnv<'a>, _args: &Vec<NifTerm<'a>>) -> NifResult<NifTerm<'a>> {
    for _ in 0..1000 {
        println!("testtesttesttesttesttesttesttesttesttesttesttest");
    }

    Ok(true.encode(env))
}

hansihe avatar Jan 19 '17 13:01 hansihe

hmm...but this isn't real world use case, is it? Seems only useful for debugging purposes.

scrogson avatar Jan 19 '17 13:01 scrogson

Which is a very real world use case. I have hit this issue numerous times when writing different kinds of NIFs.

hansihe avatar Jan 19 '17 13:01 hansihe

Yeah, this has hit me more than once, just trying to debug misbehaving tests.

Rustler can provide its own print! and println! macros that work around this.

jorendorff avatar Jan 26 '17 23:01 jorendorff

erlnif have some api can provide async thread spawn which can aviod such situations that going to block the beam

zhaohansprt avatar Mar 31 '17 02:03 zhaohansprt

Rustler has APIs like that too, but they don't do anything to solve this problem.

I still think just replacing print! and println! is the way to go here.

jorendorff avatar Mar 31 '17 02:03 jorendorff

cool!

zhaohansprt avatar Mar 31 '17 02:03 zhaohansprt

Using println! repeatedly is a costly operation, as a lock has to be acquired on each invocation. By acquiring the lock only once, this can possibly be worked around. The following works on my machine without a panic:

use rustler::{Encoder, NifResult};
use std::io::Write;
fn crash<'a>(env: Env<'a>, _: &[Term]) -> NifResult<Term<'a>> {
    let stdout = std::io::stdout();
    let mut lock = stdout.lock();
    for _ in 0..10000 {
        writeln!(&mut lock, "testtesttesttesttesttesttesttesttesttesttesttest");
    }

    Ok(true.encode(env))
}

Of course, this does not solve the problem, as simply increasing the iterations will still lead to panics. 100_000 iterations of the loop above result in a panic here.

evnu avatar Jan 11 '19 10:01 evnu

FWIW, I just hit this problem with only a few KB, (one terminal screen full or so, no loops, just individual logging statements), and it cost me a day chasing down where my panic was coming from, as I assumed I had a logic error somewhere. As we may not have good source-level debugging tools for Rust code running inside the BEAM, early development can be highly-dependent on debug messages.

That said, there is a workaround to at least avoid the crash:

use gag::Redirect;
…
  let log = OpenOptions::new()
    .truncate(true)
    .read(true)
    .create(true)
    .write(true)
    .open("/var/log/…")
    .unwrap();

  let print_redirect = Redirect::stdout(log).unwrap();
  let pd_heap = Box::new(print_redirect);
  let _ = Box::into_raw(pd_heap);

This has the awkward effect of segregating output of IO.put/inspect from the NIF output, but it's better than mysterious panics.

shribe avatar Jan 20 '20 17:01 shribe

I can trigger it reliably with https://github.com/evnu/rustler/commit/6d1ce19271193806f2a2270482c0e5eadbfee17c if I run it through rust-gdb and an ERTS Debug VM. When running the example directly with rust 1.40 and alacritty, the panic happens very rarely here.

The panic happens here, after the output should have been written to stdout:

    if let Err(e) = result {
        panic!("failed printing to {}: {}", label, e);
    }

Alas, I haven't been able to get the values of label or e. A stacktrace can be reproduced as follows. Note that I manually add the rust source as a directory in gdb.

$ rust-gdb --args bash $ERL_TOP/bin/cerl -kernel shell_history enabled -pa $(find /usr/lib/elixir/lib/*/ebin) -elixir ansi_enabled true -noshell -s elixir start_cli -- -extra /usr/bin/mix test test/panic_on_output_test.exs
(gdb) directory /home/evnu/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust
(gdb) b rust_panic
(gdb) run
...
line = 353428
line = 353429[Switching to Thread 0x7fffb4279700 (LWP 28195)]

Thread 7 "3_scheduler" hit Breakpoint 1, rust_panic () at src/libstd/panicking.rs:513
513	        __rust_start_panic(obj as usize)
(gdb) bt
#0  rust_panic () at src/libstd/panicking.rs:513
....
#4  0x00007fff37d1a812 in std::io::stdio::print_to () at src/libstd/io/stdio.rs:792
#5  std::io::stdio::_print () at src/libstd/io/stdio.rs:802
#6  0x00007fff37cb8a9c in <rustler_test::test_panic::too_much_output as rustler::nif::Nif>::RAW_FUNC::nif_func::wrapper::{{closure}}::too_much_output (n=10000000) at rustler_tests/native/rustler_test/src/test_panic.rs:4
...
#10 0x00007fff37ca0ae7 in std::panicking::try (f=...) at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panicking.rs:265
#11 0x00007fff37c417a4 in std::panic::catch_unwind (f=...) at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panic.rs:396
#12 0x00007fff37cb891e in <rustler_test::test_panic::too_much_output as rustler::nif::Nif>::RAW_FUNC::nif_func::wrapper (env=..., args=...) at rustler_tests/native/rustler_test/src/test_panic.rs:1
#13 0x00007fff37cb8879 in <rustler_test::test_panic::too_much_output as
....

evnu avatar Jan 21 '20 21:01 evnu

@evnu can you swap the use of println! with env_logger and log crates and see if you can trigger a panic? So far I've had good luck with that combination in https://github.com/scrogson/franz

scrogson avatar Jan 21 '20 21:01 scrogson

I added env_logger to the test with https://github.com/evnu/rustler/commit/ec6427d6ee1bf09263cd0c8bec7ba2e0a9430dd2. I can no longer reproduce the panic with gdb:

RUST_LOG=info rust-gdb --args bash /home/evnu/tools/otp/bin/cerl -kernel shell_history enabled -pa $(find /usr/lib/elixir/lib/*/ebin) -elixir ansi_enabled true -noshell -s elixir start_cli -- -extra /usr/bin/mix test test/panic_on_output_test.exs

evnu avatar Jan 21 '20 21:01 evnu

I tried to get some more info using lldb. While I wasn't able to break at rust_panic, I got some more info on the specific error. Maybe this can help to track down the cause.

thread '<unnamed>' panicked at 'failed printing to stdout: Resource temporarily unavailable (os error 11)', src/libstd/io/stdio.rs:792:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

I ran lldb like this:

$ echo '/home/evnu/tools/otp/bin/cerl -kernel shell_history enabled -pa $(find /usr/lib/elixir/lib/*/ebin) -elixir ansi_enabled true -noshell -s elixir start_cli -- -extra /usr/bin/mix test test/panic_on_output_test.exs' > run.sh
$ rust-lldb -- bash run.sh

Error 11 = EAGAIN:

~/tools/rustler(72-panic-output ✗) errno 11
EAGAIN 11 Resource temporarily unavailable

evnu avatar Jan 22 '20 22:01 evnu

Closing, I think we cannot do something here.

evnu avatar Nov 11 '22 11:11 evnu

It looks like setting stdio as nonblocking is generally considered bad practice as it not only propagates to external libraries but also affects every program sharing the tty.

Turns out it surely had caused other complaints upstream for the reason I mentioned: https://github.com/erlang/otp/issues/3150. It looks like a new tty driver implementation is in the pipeline and will hopefully fix this issue in a future release: https://github.com/erlang/otp/commit/bd0865ff86f06bba13a99e38db5d1f0a124e5cb3.

ishitatsuyuki avatar Dec 03 '22 07:12 ishitatsuyuki