freenet-core icon indicating copy to clipboard operation
freenet-core copied to clipboard

Critical: WASM contract execution blocks event loop for 30-90 seconds causing gateway failures

Open sanity opened this issue 5 months ago • 1 comments

Problem Summary

The gateway event loop is experiencing severe blocking (30-90 seconds) which causes channel overflow, dropped packets, and connection failures. This is affecting production gateways and preventing normal operation.

Root Cause

The WASM contract execution in crates/core/src/wasm_runtime/contract.rs blocks the tokio runtime with thread::sleep():

fn handle_execution_call(
    r: JoinHandle<(Result<i64, wasmer::RuntimeError>, Store)>,
    rt: &mut super::Runtime,
) -> Result<i64, Errors> {
    for _ in 0..5 {
        if r.is_finished() {
            break;
        }
        thread::sleep(Duration::from_secs(1));  // <-- Blocks tokio runtime\!
    }
    if \!r.is_finished() {
        return Err(Errors::MaxComputeTimeExceeded);
    }
    let (r, s) = r
        .join()
        .map_err( < /dev/null | _| Errors::Other(anyhow::anyhow!("Failed to join thread")))?;
    rt.wasm_store = Some(s);
    r.map_err(Errors::Wasmer)
}

Impact

  • Event loop blocked: No packets can be processed while waiting for contract execution
  • Channel overflow: 100-packet buffers fill up during the blocking period
  • Connection timeouts: Keep-alive mechanism fails, connections drop after 30 seconds
  • Cascade failures: Dropped packets lead to connection reestablishment attempts

Evidence

Instrumentation on vega gateway shows regular blocking patterns:

[2025-06-16T00:05:28] WARN Long gap between peer connection polls, gap_ms: 30239, queue_size: 2
[2025-06-16T00:06:33] WARN Long gap between peer connection polls, gap_ms: 54937, queue_size: 2
[2025-06-16T00:12:28] WARN Long gap between peer connection polls, gap_ms: 90111, queue_size: 2
[2025-06-16T00:14:33] WARN Long gap between peer connection polls, gap_ms: 95239, queue_size: 2

Why This Happens

  1. Contract execution spawns a separate thread (likely due to wasmer threading constraints)
  2. Main tokio runtime blocks with thread::sleep(Duration::from_secs(1)) while waiting
  3. Even if contract finishes in 10ms, still blocks for full 1 second
  4. Multiple contract executions can compound to 30-90 second blocks

Questions

  1. What are the specific threading constraints that require spawning a separate thread?
  2. Is the wasmer Store not Send/Sync?
  3. Can we use tokio::task::spawn_blocking instead?
  4. Is the 1-second sleep interval necessary or can it be reduced?

Potential Solutions

Quick Fix (minimal risk)

  • Reduce sleep from 1 second to 10-50ms
  • Use exponential backoff instead of fixed interval

Proper Fix (needs understanding of constraints)

  • Use tokio::task::spawn_blocking for contract execution
  • Eliminate the polling loop entirely
  • Use channels or async primitives for completion notification

Related Issues

  • #1658 - Keep-alive timer improvements
  • #1665 - Connection timeout after inactivity

cc @nachodg - I understand you were working on threading issues with the wasmer runtime. Any context on the constraints would be very helpful.

sanity avatar Jun 16 '25 00:06 sanity

Investigation Update

I've found the root cause of why this was implemented this way:

Historical Context

From commit 5c6ffda3 (PR #1401 "fix: don't block on contracts running for too long"):

  • Original problem: Contracts were blocking the thread using thread::scope
  • The fix: Changed to thread::spawn to avoid blocking
  • Unintended consequence: The 1-second polling sleep is actually worse than the original blocking!

Why Threading is Required

The wasmer Store has ownership constraints:

  • Requires exclusive &mut Store access
  • Cannot be shared across threads or held across await points
  • This prevents using normal async patterns like tokio::task::spawn_blocking without careful ownership transfer

The Irony

The fix to prevent blocking made it worse:

  • Before: Blocked for actual contract execution time (could be 10ms)
  • After: Blocks for at least 1 second regardless of execution time
  • Result: Fast contracts that complete in 10ms still block for 1000ms

Proposed Solutions

1. Immediate Fix (Low Risk)

Change the polling interval from 1 second to 10ms with exponential backoff:

fn handle_execution_call(
    r: JoinHandle<(Result<i64, wasmer::RuntimeError>, Store)>,
    rt: &mut super::Runtime,
) -> Result<i64, Errors> {
    let mut wait_ms = 10;
    let mut total_waited = 0;
    
    loop {
        if r.is_finished() {
            break;
        }
        
        if total_waited >= 5000 {  // 5 second timeout
            return Err(Errors::MaxComputeTimeExceeded);
        }
        
        thread::sleep(Duration::from_millis(wait_ms));
        total_waited += wait_ms;
        wait_ms = (wait_ms * 2).min(100); // Exponential backoff, cap at 100ms
    }
    
    let (r, s) = r.join()
        .map_err( < /dev/null | _| Errors::Other(anyhow::anyhow\!("Failed to join thread")))?;
    rt.wasm_store = Some(s);
    r.map_err(Errors::Wasmer)
}

This reduces blocking from 1-5 seconds to 10-100ms for typical contracts.

2. Proper Fix with spawn_blocking (Medium Risk)

Transfer Store ownership to a blocking task:

let store = self.wasm_store.take().unwrap();
let (result, store) = tokio::task::spawn_blocking(move || {
    let result = validate_func.call(&mut store, param_buf_ptr, state_buf_ptr, related_buf_ptr);
    (result, store)
}).await.map_err(|e| Errors::Other(e.into()))?;
self.wasm_store = Some(store);

3. Robust Long-term Solution (Requires Design)

  • Dedicated WASM executor service with its own thread pool
  • Message-passing architecture for contract execution
  • Proper cancellation and timeout handling
  • Performance metrics and monitoring

Next Steps

  1. I'll implement the immediate fix (10ms polling) for testing
  2. This should dramatically reduce the blocking while maintaining safety
  3. We can then design a proper async solution that respects wasmer's constraints

The goal is a robust and efficient solution that:

  • Never blocks the tokio runtime
  • Handles contract timeouts gracefully
  • Maintains high throughput for contract execution
  • Provides proper backpressure when overloaded

sanity avatar Jun 16 '25 00:06 sanity

@sanity isn't this in particular fixed now?

iduartgomez avatar Jul 14 '25 06:07 iduartgomez

This should be fixed for good now with the new executor pool. Need to add better integration tests though.

iduartgomez avatar Jul 22 '25 12:07 iduartgomez