freenet-core Critical: WASM contract execution blocks event loop for 30-90 seconds causing gateway failures

Problem Summary

The gateway event loop is experiencing severe blocking (30-90 seconds) which causes channel overflow, dropped packets, and connection failures. This is affecting production gateways and preventing normal operation.

Root Cause

The WASM contract execution in crates/core/src/wasm_runtime/contract.rs blocks the tokio runtime with thread::sleep():

fn handle_execution_call(
    r: JoinHandle<(Result<i64, wasmer::RuntimeError>, Store)>,
    rt: &mut super::Runtime,
) -> Result<i64, Errors> {
    for _ in 0..5 {
        if r.is_finished() {
            break;
        }
        thread::sleep(Duration::from_secs(1));  // <-- Blocks tokio runtime\!
    }
    if \!r.is_finished() {
        return Err(Errors::MaxComputeTimeExceeded);
    }
    let (r, s) = r
        .join()
        .map_err( < /dev/null | _| Errors::Other(anyhow::anyhow!("Failed to join thread")))?;
    rt.wasm_store = Some(s);
    r.map_err(Errors::Wasmer)
}

Impact

Event loop blocked: No packets can be processed while waiting for contract execution
Channel overflow: 100-packet buffers fill up during the blocking period
Connection timeouts: Keep-alive mechanism fails, connections drop after 30 seconds
Cascade failures: Dropped packets lead to connection reestablishment attempts

Evidence

Instrumentation on vega gateway shows regular blocking patterns:

[2025-06-16T00:05:28] WARN Long gap between peer connection polls, gap_ms: 30239, queue_size: 2
[2025-06-16T00:06:33] WARN Long gap between peer connection polls, gap_ms: 54937, queue_size: 2
[2025-06-16T00:12:28] WARN Long gap between peer connection polls, gap_ms: 90111, queue_size: 2
[2025-06-16T00:14:33] WARN Long gap between peer connection polls, gap_ms: 95239, queue_size: 2

Why This Happens

Contract execution spawns a separate thread (likely due to wasmer threading constraints)
Main tokio runtime blocks with thread::sleep(Duration::from_secs(1)) while waiting
Even if contract finishes in 10ms, still blocks for full 1 second
Multiple contract executions can compound to 30-90 second blocks

Questions

What are the specific threading constraints that require spawning a separate thread?
Is the wasmer Store not Send/Sync?
Can we use tokio::task::spawn_blocking instead?
Is the 1-second sleep interval necessary or can it be reduced?

Potential Solutions

Quick Fix (minimal risk)

Reduce sleep from 1 second to 10-50ms
Use exponential backoff instead of fixed interval

Proper Fix (needs understanding of constraints)

Use tokio::task::spawn_blocking for contract execution
Eliminate the polling loop entirely
Use channels or async primitives for completion notification

Related Issues

#1658 - Keep-alive timer improvements
#1665 - Connection timeout after inactivity

cc @nachodg - I understand you were working on threading issues with the wasmer runtime. Any context on the constraints would be very helpful.

Jun 16 '25 00:06 sanity

Investigation Update

I've found the root cause of why this was implemented this way:

Historical Context

From commit 5c6ffda3 (PR #1401 "fix: don't block on contracts running for too long"):

Original problem: Contracts were blocking the thread using thread::scope
The fix: Changed to thread::spawn to avoid blocking
Unintended consequence: The 1-second polling sleep is actually worse than the original blocking!

Why Threading is Required

The wasmer Store has ownership constraints:

Requires exclusive &mut Store access
Cannot be shared across threads or held across await points
This prevents using normal async patterns like tokio::task::spawn_blocking without careful ownership transfer

The Irony

The fix to prevent blocking made it worse:

Before: Blocked for actual contract execution time (could be 10ms)
After: Blocks for at least 1 second regardless of execution time
Result: Fast contracts that complete in 10ms still block for 1000ms

Proposed Solutions

1. Immediate Fix (Low Risk)

Change the polling interval from 1 second to 10ms with exponential backoff:

fn handle_execution_call(
    r: JoinHandle<(Result<i64, wasmer::RuntimeError>, Store)>,
    rt: &mut super::Runtime,
) -> Result<i64, Errors> {
    let mut wait_ms = 10;
    let mut total_waited = 0;
    
    loop {
        if r.is_finished() {
            break;
        }
        
        if total_waited >= 5000 {  // 5 second timeout
            return Err(Errors::MaxComputeTimeExceeded);
        }
        
        thread::sleep(Duration::from_millis(wait_ms));
        total_waited += wait_ms;
        wait_ms = (wait_ms * 2).min(100); // Exponential backoff, cap at 100ms
    }
    
    let (r, s) = r.join()
        .map_err( < /dev/null | _| Errors::Other(anyhow::anyhow\!("Failed to join thread")))?;
    rt.wasm_store = Some(s);
    r.map_err(Errors::Wasmer)
}

This reduces blocking from 1-5 seconds to 10-100ms for typical contracts.

2. Proper Fix with spawn_blocking (Medium Risk)

Transfer Store ownership to a blocking task:

let store = self.wasm_store.take().unwrap();
let (result, store) = tokio::task::spawn_blocking(move || {
    let result = validate_func.call(&mut store, param_buf_ptr, state_buf_ptr, related_buf_ptr);
    (result, store)
}).await.map_err(|e| Errors::Other(e.into()))?;
self.wasm_store = Some(store);

3. Robust Long-term Solution (Requires Design)

Dedicated WASM executor service with its own thread pool
Message-passing architecture for contract execution
Proper cancellation and timeout handling
Performance metrics and monitoring

Next Steps

I'll implement the immediate fix (10ms polling) for testing
This should dramatically reduce the blocking while maintaining safety
We can then design a proper async solution that respects wasmer's constraints

The goal is a robust and efficient solution that:

Never blocks the tokio runtime
Handles contract timeouts gracefully
Maintains high throughput for contract execution
Provides proper backpressure when overloaded

Jun 16 '25 00:06 sanity

@sanity isn't this in particular fixed now?

Jul 14 '25 06:07 iduartgomez

This should be fixed for good now with the new executor pool. Need to add better integration tests though.

Jul 22 '25 12:07 iduartgomez

freenet-core freenet-core copied to clipboard

Critical: WASM contract execution blocks event loop for 30-90 seconds causing gateway failures

Problem Summary

Root Cause

Impact

Evidence

Why This Happens

Questions

Potential Solutions

Quick Fix (minimal risk)

Proper Fix (needs understanding of constraints)

Related Issues

Investigation Update

Historical Context

Why Threading is Required

The Irony

Proposed Solutions

1. Immediate Fix (Low Risk)

2. Proper Fix with spawn_blocking (Medium Risk)

3. Robust Long-term Solution (Requires Design)

Next Steps

freenet-core
freenet-core copied to clipboard