freenet-core
freenet-core copied to clipboard
Critical: WASM contract execution blocks event loop for 30-90 seconds causing gateway failures
Problem Summary
The gateway event loop is experiencing severe blocking (30-90 seconds) which causes channel overflow, dropped packets, and connection failures. This is affecting production gateways and preventing normal operation.
Root Cause
The WASM contract execution in crates/core/src/wasm_runtime/contract.rs blocks the tokio runtime with thread::sleep():
fn handle_execution_call(
r: JoinHandle<(Result<i64, wasmer::RuntimeError>, Store)>,
rt: &mut super::Runtime,
) -> Result<i64, Errors> {
for _ in 0..5 {
if r.is_finished() {
break;
}
thread::sleep(Duration::from_secs(1)); // <-- Blocks tokio runtime\!
}
if \!r.is_finished() {
return Err(Errors::MaxComputeTimeExceeded);
}
let (r, s) = r
.join()
.map_err( < /dev/null | _| Errors::Other(anyhow::anyhow!("Failed to join thread")))?;
rt.wasm_store = Some(s);
r.map_err(Errors::Wasmer)
}
Impact
- Event loop blocked: No packets can be processed while waiting for contract execution
- Channel overflow: 100-packet buffers fill up during the blocking period
- Connection timeouts: Keep-alive mechanism fails, connections drop after 30 seconds
- Cascade failures: Dropped packets lead to connection reestablishment attempts
Evidence
Instrumentation on vega gateway shows regular blocking patterns:
[2025-06-16T00:05:28] WARN Long gap between peer connection polls, gap_ms: 30239, queue_size: 2
[2025-06-16T00:06:33] WARN Long gap between peer connection polls, gap_ms: 54937, queue_size: 2
[2025-06-16T00:12:28] WARN Long gap between peer connection polls, gap_ms: 90111, queue_size: 2
[2025-06-16T00:14:33] WARN Long gap between peer connection polls, gap_ms: 95239, queue_size: 2
Why This Happens
- Contract execution spawns a separate thread (likely due to wasmer threading constraints)
- Main tokio runtime blocks with
thread::sleep(Duration::from_secs(1))while waiting - Even if contract finishes in 10ms, still blocks for full 1 second
- Multiple contract executions can compound to 30-90 second blocks
Questions
- What are the specific threading constraints that require spawning a separate thread?
- Is the wasmer Store not Send/Sync?
- Can we use
tokio::task::spawn_blockinginstead? - Is the 1-second sleep interval necessary or can it be reduced?
Potential Solutions
Quick Fix (minimal risk)
- Reduce sleep from 1 second to 10-50ms
- Use exponential backoff instead of fixed interval
Proper Fix (needs understanding of constraints)
- Use
tokio::task::spawn_blockingfor contract execution - Eliminate the polling loop entirely
- Use channels or async primitives for completion notification
Related Issues
- #1658 - Keep-alive timer improvements
- #1665 - Connection timeout after inactivity
cc @nachodg - I understand you were working on threading issues with the wasmer runtime. Any context on the constraints would be very helpful.
Investigation Update
I've found the root cause of why this was implemented this way:
Historical Context
From commit 5c6ffda3 (PR #1401 "fix: don't block on contracts running for too long"):
- Original problem: Contracts were blocking the thread using
thread::scope - The fix: Changed to
thread::spawnto avoid blocking - Unintended consequence: The 1-second polling sleep is actually worse than the original blocking!
Why Threading is Required
The wasmer Store has ownership constraints:
- Requires exclusive
&mut Storeaccess - Cannot be shared across threads or held across await points
- This prevents using normal async patterns like
tokio::task::spawn_blockingwithout careful ownership transfer
The Irony
The fix to prevent blocking made it worse:
- Before: Blocked for actual contract execution time (could be 10ms)
- After: Blocks for at least 1 second regardless of execution time
- Result: Fast contracts that complete in 10ms still block for 1000ms
Proposed Solutions
1. Immediate Fix (Low Risk)
Change the polling interval from 1 second to 10ms with exponential backoff:
fn handle_execution_call(
r: JoinHandle<(Result<i64, wasmer::RuntimeError>, Store)>,
rt: &mut super::Runtime,
) -> Result<i64, Errors> {
let mut wait_ms = 10;
let mut total_waited = 0;
loop {
if r.is_finished() {
break;
}
if total_waited >= 5000 { // 5 second timeout
return Err(Errors::MaxComputeTimeExceeded);
}
thread::sleep(Duration::from_millis(wait_ms));
total_waited += wait_ms;
wait_ms = (wait_ms * 2).min(100); // Exponential backoff, cap at 100ms
}
let (r, s) = r.join()
.map_err( < /dev/null | _| Errors::Other(anyhow::anyhow\!("Failed to join thread")))?;
rt.wasm_store = Some(s);
r.map_err(Errors::Wasmer)
}
This reduces blocking from 1-5 seconds to 10-100ms for typical contracts.
2. Proper Fix with spawn_blocking (Medium Risk)
Transfer Store ownership to a blocking task:
let store = self.wasm_store.take().unwrap();
let (result, store) = tokio::task::spawn_blocking(move || {
let result = validate_func.call(&mut store, param_buf_ptr, state_buf_ptr, related_buf_ptr);
(result, store)
}).await.map_err(|e| Errors::Other(e.into()))?;
self.wasm_store = Some(store);
3. Robust Long-term Solution (Requires Design)
- Dedicated WASM executor service with its own thread pool
- Message-passing architecture for contract execution
- Proper cancellation and timeout handling
- Performance metrics and monitoring
Next Steps
- I'll implement the immediate fix (10ms polling) for testing
- This should dramatically reduce the blocking while maintaining safety
- We can then design a proper async solution that respects wasmer's constraints
The goal is a robust and efficient solution that:
- Never blocks the tokio runtime
- Handles contract timeouts gracefully
- Maintains high throughput for contract execution
- Provides proper backpressure when overloaded
@sanity isn't this in particular fixed now?
This should be fixed for good now with the new executor pool. Need to add better integration tests though.