actix-web icon indicating copy to clipboard operation
actix-web copied to clipboard

Reproducible race-condition(s) in HttpServer::bind/run

Open elliottdehn opened this issue 2 years ago • 1 comments
trafficstars

Expected Behavior

  • Calling HttpServer::bind on an (address, port) that another process is bound to always results in an Err.
  • Calling HttpServer::run on an HttpServer that does (somehow) get bound to such an occupied (address, port) doesn't result in a Future that will at some point in its course of execution call Future::poll.

Current Behavior

This is a log history of HttpServer instances on a single machine (an M1 MacBook Pro) having been bound to an (address, port) and run to yield a Future that calls Future::poll (denoted +port), and detected as stopped by others (denoted -port). It concerns at most N* processes (denoted pid, a dynamic set) and at most N ports ([8080-8090), a static set) that may be bound-to via HttpServer::bind at any given time (denoted seconds.millisecs.microsecs):

*The race-condition results in violations of this invariant: in the sample history, N is 10--the number of physical cores the affected machine possesses.

...
(time: 1669338224.696.276) (pid: 93302) -8080 (nonce: 0)
(time: 1669338224.697.161) (pid: 93281) -8087 (nonce: 0)
(time: 1669338224.697.178) (pid: 93306) -8080 (nonce: 1)
(time: 1669338224.698.629) (pid: 93306) -8087 (nonce: 2)
(time: 1669338224.699.300) (pid: 93306) -8089 (nonce: 3)
(time: 1669338224.703.130) (pid: 93309) +8080 (cause: (pid: 93302, nonce: 0))
(time: 1669338224.703.282) (pid: 93311) +8080 (cause: (pid: 93306, nonce: 1))
...

The above proceeds as expected until the last line (#125 of this contiguous and comprehensive excerpt), where we observe process 93311 bind-ing and run-ning (to yield a Future that calls Future::poll) an HttpServer upon (localhost, 8080) despite process 93309 having done the same exact thing ~90 microseconds earlier. Perhaps worth mentioning is the fact that these ~~processes live for at least 1 second, upon and after which~~ (see comment) each process has a 50% chance to abort itself--every second.

For enhanced clarity, this is how the server is being bound, run and (if a suspended run is evident) signaled in the log history (via slog):

// (...)
      let srv = HttpServer::new(|| App::new().service(_gen_time));
      let res = srv
        .bind(("localhost", port_try))
        .map(|s| (&mut s.run()).now_or_never());
      match res {
        Ok(ret) => {
          if ret.is_none() {
            // "no return" means "the future got suspended"
            slog(&format!(
              "+{} (cause: (pid: {}, nonce: {}))",
              port_try,
              argv().get(3).unwrap_or(&String::from("-1")),
              argv().get(4).unwrap_or(&String::from("-1")),
            ));
          }
          _append_one(
            "host",
            &LocalHost {
              pid: id(),
              port: port_try,
            },
          );
          Ok(port_try)
        }
        _ => Err(port_try),
      }
 // (...)

Now that I look, the pattern also appears on line #24 of the same history (and last line of the below excerpt):

(time: 1669338220.562.406) (pid: 93222) -8084 (nonce: 0)
(time: 1669338220.562.432) (pid: 93221) -8084 (nonce: 1)
(time: 1669338220.562.432) (pid: 93223) -8084 (nonce: 0)
(time: 1669338220.567.601) (pid: 93221) -8082 (nonce: 2)
(time: 1669338220.567.601) (pid: 93222) -8082 (nonce: 1)
(time: 1669338220.570.526) (pid: 93229) +8084 (cause: (pid: 93222, nonce: 0))
(time: 1669338220.570.579) (pid: 93228) +8084 (cause: (pid: 93223, nonce: 0))

In the above, the "race interval" is just ~50 microseconds! There are sure to be more occurrences in the linked history, but I'm going to spare myself (and readers) an exhaustive search for all of them--I think the issue is now clear enough.

Possible Solution

Use a custom file-lock oriented solution for the time being, most likely. It seems like something that should be solved at a deeper level though.

I don't know what the solution is (apart from "use a different framework"), or even if the problem is owned by Actix itself (perhaps the issue is with Rust, or OSX). Right now I'm "somewhat confident" that there is indeed a race-condition being revealed here (rather than a bug of the affected code) and feel that others with more knowledge may be able to provide insight and/or triage the issue along further better than I can at this point.

Steps to Reproduce (for bugs)

(see repository) - or main.rs

Don't panic! It's ~350 lines of code, most of which are not significant. Here's the most interesting part, main:

#[actix_web::main]
async fn main() -> io::Result<()> {
  let argv = env::args().collect::<Vec<String>>();
  let exec = &argv[0];
  let port = argv.get(1).map(|s| s.parse::<Port>().x()).unwrap_or(8080);
  let card = argv // Target cardinality of healthy server distribution.
    .get(2)
    .map(|s| s.parse::<usize>().x())
    .unwrap_or_else(num_cpus::get_physical);

  // Create this server and start listening on nearest available port!
  let port_deploy = if card > 0 {
    gen_deploy((port, None)) // Deploy FROM port (may/will wander).
  } else {
    gen_deploy((port, Some(port))) // Deploy TO port (never wander).
  }
  .await
  .expect("Failed to deploy the binary"); // Many will try; one wins.

  // HAPPENS ONCE: Recursive self-execution to achieve N-cardinality.
  if card > 1 {
    let mut exec_init = Command::new(exec);
    exec_init.arg((port_deploy + 1).to_string()); // Wanders from last.
    exec_init.arg((card as u16 - 1).to_string());
    exec_init.spawn().expect("Failed to spawn the child"); // No block!
  } // When single-servers get re-deployed a cardinality 0 is provided.

  // EVERY SECOND: The server has a 50% chance to abort itself QUIETLY!
  // >> This is just for demonstration and SHOULD NOT be in production.
  let mut rng = rand::thread_rng();
  loop {
    for port_host in get_host_ports() {
      let response =
        reqwest::get(format!("http://localhost:{}/__/time", port_host))
          .await;
      match response {
        Ok(_) => {}
        Err(_) => {
          /* "Oh dear, you are dead!" */
          gen_respawn(port_host).await
        }
      }
    }

    if rng.gen::<f64>() > 0.5 {
      abort();
      /* "Oh dear, I am dead!" */
    } else {
      task::sleep(Duration::from_secs(1)).await;
    }
  }
}

Reproduction is as simple as building and running (cargo run) the above; no argument provision is necessary. It will produce the log history seen above in a file called slog; the host file is functional and not of any particular interest here. This code (as long as no arguments are provided) will deploy N HttpServer instances across N processes (modulo this issue), probing from port 8080 upwards, where N is the number of physical cores on the machine. If you'd like to probe from a different port, pass it as the first argument; if you'd like to use a different number of processes, pass it as the second.

If you'd like to deploy exactly to a port with no probing or distribution at all, pass 0 as the number of processes. That's the extent of the code's operable behavior. Use chmod 000 host to decisively kill all of the past and current server processes; this kills the read/write/anything-permission on a file all need to continue (and/or sustain) operation. That's the extent of the code's known externalities.

Context

I'm building a web server that supports continuous integration of new binary versions with minimum (or no) loss/degradation in service quality/availability. Since I plan (or anticipate) to deploy services in multiple languages, and not all of those languages have parallelism capabilities (at least, exercising such would be non-standard), in order to make deployments agnostic to the language that produced a binary it makes sense to deploy N single-threaded processes to N physical cores on a machine--and then load-distribute across them--as opposed to using 1 process with N worker threads. There should never be more than one process on a network port, and no system should be designed from the outset to anticipate/allow for such a thing--therefore, the impact of this issue is apparent non-viability of (what appears to be) a sound system design at least via the Actix Web framework.

Once again: I don't know for a fact that the issue lies with Actix. It may be necessary to replicate the system in a different framework (and a different language if still affected, which would indicate an issue with Rust or OSX). I'll do that, but wanted to report this now so that the author(s) are aware of a stumbling-point that affects this project (and perhaps others, yet to be determined).

Your Environment

Operating System: OSX Monteray 12.2.1
Model Name:	MacBook Pro
Model Identifier:	MacBookPro18,4
Chip:	Apple M1 Max
Total Number of Cores:	10 (8 performance and 2 efficiency)
Memory:	64 GB
System Firmware Version: 7429.81.3
OS Loader Version:	7429.81.3

I can provide more information if doing so would be helpful.

  • Rust Version (I.e, output of rustc -V): rustc 1.67.0-nightly (215e3cd21 2022-11-03)
  • Actix Web Version: 4.2.1

elliottdehn avatar Nov 25 '22 03:11 elliottdehn

More info: I thought I may have identified the issue, that being the fact processes have an "immediate" (modulo some speed bumps) opportunity to self-abort before commencing into the 1-second sleep interval loop (a natural condition that should be anticipated/designed for, but which isn't an issue of Actix). Should a server get bound and then "immediately" self-abort, then it's possible for another process to appear as if it's binding to an occupied port; that'd not be an issue with Actix--it'd instead be resilience to an unexpected degree via this design (try saying "deeply-pipelined port binding" three times fast).

Anyway, so... I went and changed this (of main above):

  // EVERY SECOND: The server has a 50% chance to abort itself QUIETLY!
  // >> This is just for demonstration and SHOULD NOT be in production.
  let mut rng = rand::thread_rng();
  loop {
    for port_host in get_host_ports() {
      let response =
        reqwest::get(format!("http://localhost:{}/__/time", port_host))
          .await;
      match response {
        Ok(_) => {}
        Err(_) => {
          /* "Oh dear, you are dead!" */
          gen_respawn(port_host).await
        }
      }
    }

    if rng.gen::<f64>() > 0.5 {
      abort();
      /* "Oh dear, I am dead!" */
    } else {
      task::sleep(Duration::from_secs(1)).await; // <== SLEEP IN QUESTION
    }
  }

...to this:

  // EVERY SECOND: The server has a 50% chance to abort itself QUIETLY!
  // >> This is just for demonstration and SHOULD NOT be in production.
  let mut rng = rand::thread_rng();
  loop {
    task::sleep(Duration::from_secs(1)).await; // <== SLEEP IN QUESTION
    for port_host in get_host_ports() {
      let response =
        reqwest::get(format!("http://localhost:{}/__/time", port_host))
          .await;
      match response {
        Ok(_) => {}
        Err(_) => {
          /* "Oh dear, you are dead!" */
          gen_respawn(port_host).await
        }
      }
    }

    if rng.gen::<f64>() > 0.5 {
      abort();
      /* "Oh dear, I am dead!" */
    }
  }

(note for clarity: moved the sleep to the loop's head, instead of its tail, so as to precede the process abortion coin-flip.)

That makes it so a process that binds an HttpServer in this context must live for at least at least one full second before having the opportunity to abort itself. However, the issue is still observable even after this change:

(...)
(time: 1669348750.294.744) (pid: 47283) -8081 (nonce: 0)
(time: 1669348750.296.913) (pid: 47282) -8081 (nonce: 0)
(time: 1669348750.299.172) (pid: 47281) -8089 (nonce: 0)
(time: 1669348750.300.666) (pid: 47283) -8089 (nonce: 1)
(time: 1669348750.300.660) (pid: 47282) -8089 (nonce: 1)
(time: 1669348750.305.794) (pid: 47292) -8081 (nonce: 0)
(time: 1669348750.309.777) (pid: 47268) -8089 (nonce: 5)
(time: 1669348750.309.813) (pid: 47291) -8089 (nonce: 0)
(time: 1669348750.311.148) (pid: 47297) +8081 (cause: (pid: 47282, nonce: 0))
(time: 1669348750.311.234) (pid: 47296) +8081 (cause: (pid: 47283, nonce: 0))
(...)

...this was a surprising negative result!

elliottdehn avatar Nov 25 '22 04:11 elliottdehn