[Bug]: Using `futures_executor::block_on` blocks forever in WASM
What happened?
Using this library in WASM causes the program to block forever and use 100% CPU, due to the usage of futures_executor::block_on
API Version
0.24.0
SDK Version
0.24.1
What Exporter(s) are you seeing the problem on?
N/A
Using futures_executor::block_on is harmful to performance and correctness.
In single-threaded contexts, blocking in place leads to blocking the entire runtime forever, and block_on causes that thread to busy loop until the future resolves, using 100% of CPU.
In a multithreaded environment it is also harmful as it blocks an entire thread, and if that thread happens to have other tasks scheduled on it, this will cause those tasks to be starved of resources, worse still if the future the thread is blocking on is waiting for a task that's scheduled on that same thread via a mechanism such as the tokio LIFO slot, leading to a deadlock.
See this draft PR as a first attempt to address this problem.
In order to make this work for WASM I introduced a spawn_future function, which does the right thing in WASM but continues using the bad futures_executor::block_on function in other environments. With this patch I stopped having problems in the WASM project I'm working on, thus proving that this was the root of the issue.
Next I made some more methods async. The trait methods I made async are:
-
LogProcessor::emit -
LogProcessor::force_flush -
LogProcessor::shutdown -
SpanProcessor::on_end -
SpanProcessor::force_flush -
SpanProcessor::shutdown
This avoids having to call spawn_future inside these functions. I couldn't however avoid calling this inside the Drop implementations that needed to call SpanProcessor::shutdown, as well as, inside periodic_reader.rs because I was unsure as to how best to proceed. So there are still some open questions.
Thanks for opening the issue with detailed analysis and the PR! I left a comment on the PR https://github.com/open-telemetry/opentelemetry-rust/pull/2048/files#r1729165305 .
We need to see what is the right way to execute all the backgrounds tasks (batch exporting, metric reader). @ThomsonTan was also looking into this. Tagging @TommyCpp and @lalitb also to share ideas/concerns.
In a multithreaded environment it is also harmful as it blocks an entire thread, and if that thread happens to have other tasks scheduled on it, this will cause those tasks to be starved of resources, worse still if the future the thread is blocking on is waiting for a task that's scheduled on that same thread via a mechanism such as the tokio LIFO slot, leading to a deadlock.
We are exploring a design where OTel will have its own dedicated thread, where these issues are going to be avoided. Do you consider it an issue, if Otel does a blocking call in its own background thread? That thread is OTel's own, and won't have any other tasks scheduled on it.
I keep hitting the same issues on regular x86_64 linux builds when using Tokio. I am not entirely sure what's causing this but all runtime.block_on calls hang. The presence of blocking calls seems to make most of the API brittle.
Yes, block_on is problematic also in async contexts run by Tokio. This minimized test blocks:
use opentelemetry_sdk::{runtime, testing::trace::NoopSpanExporter, trace::TracerProvider};
#[tokio::test]
async fn shutdown_in_async_scope() {
let provider = TracerProvider::builder()
.with_batch_exporter(NoopSpanExporter::new(), runtime::Tokio)
.build();
provider.shutdown().unwrap();
}
One workaround is to call shutdown in a worker thread:
#[tokio::test]
async fn shutdown_in_tokio_worker() {
let provider = TracerProvider::builder()
.with_batch_exporter(NoopSpanExporter::new(), runtime::Tokio)
.build();
tokio::task::spawn_blocking(move || provider.shutdown())
.await
.unwrap()
.unwrap();
}
@mzabaluev @demurgos https://github.com/open-telemetry/opentelemetry-rust/pull/2403 switched the default for Metric's PeriodicReader to use dedicated threads. Yes it does block_on, but only on its own thread, and not on user thread, so this should not be a concern.
I'll open a separate issue to track adding shutdown async as well.
BatchProcessors,PeriodicReader have been modified to use own dedicated thread and make blocking calls from that thread. I'll close this issue, and open a separate one to track support wasm. (There are other issues reported related to wasm as well).
If there are pending questions on this, please feel free to reopen or open a new one.