ord icon indicating copy to clipboard operation
ord copied to clipboard

ord server not distributing work to threads correctly

Open gmart7t2 opened this issue 1 year ago • 8 comments

I've noticed that the ord web server doesn't distribute work very evenly between CPU cores. When I make a lot of requests to it at the same time, only one of the requests gets processed immediately and the rest have to wait until the first request is finished.

To demonstrate the effect, I made a bare-bones version of the ord server:

use {
  axum::{routing::get, Router},
  std::{net::SocketAddr, thread, time::Duration},
  tokio::runtime::Runtime,

  anyhow::Error,
  std::io,
  tokio::task,
};

type Result<T = (), E = Error> = std::result::Result<T, E>;

pub fn main() {
  Runtime::new().unwrap().block_on(async {
    let router = Router::new().route("/", get(|| async {
      thread::sleep(Duration::from_millis(1000)); "Hello, world!\n"
    }));

//    let addr = SocketAddr::from(([127, 0, 0, 1], 3000));
//    eprintln!("Listening on {}://{addr}", "http");
//    axum_server::Server::bind(addr).serve(router.into_make_service()).await.unwrap();

    spawn(router).unwrap().await.unwrap().unwrap();
  });
}

fn spawn(router: Router) -> Result<task::JoinHandle<io::Result<()>>> {
  let addr = SocketAddr::from(([127, 0, 0, 1], 3000));
  eprintln!("Listening on {}://{addr}", "http");
  Ok(tokio::spawn(async move { axum_server::Server::bind(addr).serve(router.into_make_service()).await }))
}

Run it, and hit / 4 times very quickly. The first request will come back after 1 second but the other 3 usually take 2 seconds.

Then modify the code to the following and try again:

use {
  axum::{routing::get, Router},
  std::{net::SocketAddr, thread, time::Duration},
  tokio::runtime::Runtime,

  // anyhow::Error,
  // std::io,
  // tokio::task,
};

// type Result<T = (), E = Error> = std::result::Result<T, E>;

pub fn main() {
  Runtime::new().unwrap().block_on(async {
    let router = Router::new().route("/", get(|| async {
      thread::sleep(Duration::from_millis(1000)); "Hello, world!\n"
    }));

    let addr = SocketAddr::from(([127, 0, 0, 1], 3000));
    eprintln!("Listening on {}://{addr}", "http");
    axum_server::Server::bind(addr).serve(router.into_make_service()).await.unwrap();

//    spawn(router).unwrap().await.unwrap().unwrap();
  });
}

// fn spawn(router: Router) -> Result<task::JoinHandle<io::Result<()>>> {
//   let addr = SocketAddr::from(([127, 0, 0, 1], 3000));
//   eprintln!("Listening on {}://{addr}", "http");
//   Ok(tokio::spawn(async move { axum_server::Server::bind(addr).serve(router.into_make_service()).await }))
// }

Now all 4 requests return after 1 second.

It appears to be the tokio::spawn() call that is messing things up, but I have no idea why.

gmart7t2 avatar Jul 20 '23 01:07 gmart7t2

This is what I see from each piece of code.

First first one:

$ for i in {1..4}; do time wget -O/dev/null -o/dev/null http://localhost:3000/ & done
[1] 4006905
[2] 4006906
[3] 4006908
[4] 4006910

real	0m1.009s
real	0m2.008s
real	0m2.009s
real	0m2.008s

The second one:

$ for i in {1..4}; do time wget -O/dev/null -o/dev/null http://localhost:3000/ & done
[1] 4007012
[2] 4007013
[3] 4007015
[4] 4007017

real	0m1.009s
real	0m1.009s
real	0m1.009s
real	0m1.009s

gmart7t2 avatar Jul 20 '23 01:07 gmart7t2

Interesting. Dug a little into how concurrency is being handled by Tokio and by the Rust standard library's thread::sleep function. When you use thread::sleep, you're pausing the entire operating system thread for a specified amount of time, so when we use thread::sleep inside a Tokio async block, you're essentially blocking the Tokio executor thread. This is why you're seeing a sequential handling of requests: after the first request arrives, it puts the executor to sleep for 1 second, during which time no other tasks can be processed.

The reason the code seems to work better without tokio::spawn is because you're not actually running the server in a separate task. The server runs directly on the main Tokio runtime thread, and each request is handled in a new task. Because these tasks are not blocking the runtime thread, multiple requests can be processed concurrently.

So I assume the issue in the actual ord server is that we have some OS calls that block the Tokio executor thread and we'd have to make them async... Need to look into that more deeply.

veryordinally avatar Jul 20 '23 08:07 veryordinally

Thanks for looking into it.

This is why you're seeing a sequential handling of requests: after the first request arrives, it puts the executor to sleep for 1 second, during which time no other tasks can be processed

If I hit my sleep endpoint 20 times very quickly, I get 1 response after 1 second, then 8 responses a second later, then 8 more another second later, and the last one a second later. I have 8 cpu cores.

It appears that it is able to process 8 requests concurrently but only if they were already waiting to be processed when the current request finishes.

Does that match your understanding? Because it doesn't sound like it matches the line I quoted above.

(See in the timing output of the first code snippet https://github.com/ordinals/ord/issues/2292#issuecomment-1642972258 I get one 1s timing and three 2s timings - the last 3 requests are processed concurrently)

gmart7t2 avatar Jul 20 '23 12:07 gmart7t2

I'm experiencing a similar issue. I have a node.js server sending batches of requests to my order server and the requests are sent in a non-blocking way, but are responded to sequentially, each in an accumulating larger amount of time, e.g. req 1: 10ms, req 2: 20ms, ... req 150: 1500ms... I don't have the best metrics to provide you but I've tested my code extensively and have found this behavior only exhibits itself when making requests to ord.

Also, I am using the most recent version of the ord client (on GH) as of today, with an --index-sats and --index-runes index file. This issue has coincided with pulling recent commits within the last few days.

To help you reproduce the issue, just try requesting the content of each of the first 100 inscriptions in block 834061.

abelmcelroy avatar Mar 15 '24 17:03 abelmcelroy

Are you able to bisect it down to a specific commit that caused this regression?

raphjaph avatar Mar 17 '24 22:03 raphjaph

Are you able to bisect it down to a specific commit that caused this regression?

Unfortunately not. My FT job is pretty demanding right now, also the aforementioned commits I pulled required I update my version of my index and re-index everything so somewhere in that span of commits is a paradigm shift I really would prefer not to cross over again (the v18 index too a lot longer than my previous one). Sorry I can't be of more help.

abelmcelroy avatar Mar 18 '24 16:03 abelmcelroy

No worries! Would you willing to share the node.js server code or benchmarking script you used? Then I could use that as a starting point for debugging.

raphjaph avatar Mar 19 '24 01:03 raphjaph

No worries! Would you willing to share the node.js server code or benchmarking script you used? Then I could use that as a starting point for debugging.

Gladly! Hope this helps: I chose a block number where it's especially pronounced. If page 1 isn't demonstrating the issue i recommend temporarily changing your page size to like 1000 as opposed to 100.

const axios = require('axios');
const ordExplorerUrl = "<FILL ME IN>"

class OrdExplorer {
  async get(path) {
    console.log(`${path} started`);
    console.time(`${path} took`);
    const { data } = await axios.get(`http://${ordExplorerUrl}:81/${path}`, {
      headers: {
        "Accept": "application/json"
      }
    });
    console.timeEnd(`${path} took`);
    return data
  }
}

const ordExplorer = new OrdExplorer();
const currentBlock = 825951;

async function main() {

  console.log("searching block", currentBlock)
  try {
    let page = await ordExplorer.get(`inscriptions/block/${currentBlock}/1`);

    let ids = page.ids.filter(id => !!id);

    await Promise.all(ids.map(async (inscriptionId) => {
      let content = null;
      let errors = 0;
      while (content === null) {
        try {
          content = await ordExplorer.get(`content/${inscriptionId}`);
        } catch (err) {
          if (err.message.includes("connect EADDRNOTAVAIL")) {
            console.log("server overloaded, waiting...")
            await wait(5000);
            errors++;
          } else {
            console.log(`CONTENT FAILED ${inscriptionId}: ${err.message}`);
            return
          }
        }
      }
    }));
  } catch (error) {
    console.log(error);
    process.exit(1);
  }
  process.exit(0)
}

main();

Hopefully this recreates the same behavior for you. You'll notice some slow requests block all other requests while they're being handled. The requests are sent in a non-blocking way (which you can also see in the logs) but appear to be responded to sequentially.

Please let me know if I'm misunderstanding the behavior or if you have any suggestions. Thank you!

abelmcelroy avatar Mar 21 '24 16:03 abelmcelroy