freenet-core Race condition in connection manager - handshake completion not properly synchronized

Problem

The test test_gw_to_peer_outbound_conn_forwarded was failing in CI due to a race condition where connection establishment events are not properly synchronized. The test required a 500ms delay to work around timing issues.

Root Cause

Non-atomic connection management: add_connection() may return before the connection is fully available for forwarding operations
Asynchronous event handling without proper synchronization: The handshake handler processes events sequentially but doesn't ensure proper ordering/completion
Event emission timing: Events may be emitted before the underlying operations are fully complete

Current Workaround

Added a 500ms delay in the test between establishing peer connection and joiner connection. This is not acceptable for a production system focused on efficiency.

Expected Behavior

Connection establishment should be atomic and deterministic
Once add_connection() returns, the connection should be immediately available for forwarding
No time-based delays should be required for proper operation
Connection establishment latency should be minimal and predictable

Proposed Solutions

Synchronous connection management: Make add_connection() atomic and blocking until connection is fully ready
Event completion signals: Only emit events when operations are fully complete
Proper async coordination: Use async synchronization primitives (channels, barriers, etc.) instead of time delays
Connection state tracking: Implement proper state machine for connection lifecycle

Impact

Performance: 500ms delays are unacceptable for a distributed system
Reliability: Race conditions can cause intermittent failures
Scalability: Poor connection management affects network topology building
Developer experience: Flaky tests waste CI resources and developer time

Files Involved

crates/core/src/node/network_bridge/handshake.rs (test with workaround)
crates/core/src/ring/connection_manager.rs (connection management)
crates/core/src/operations/connect.rs (forwarding logic)

This issue should be prioritized as it affects core networking functionality and violates Freenet's efficiency design goals.

Jun 12 '25 03:06 sanity

This race condition was likely caused by the channel overflow issue. Fixed by implementing proper backpressure handling with send() instead of try_send().

Jun 14 '25 17:06 sanity

I thinkl this would be good to handle better though, probably should keep in the backlog.

Jun 14 '25 17:06 iduartgomez

Reopening per @iduartgomez's suggestion. While the channel overflow fix may have reduced the likelihood of this race condition, the underlying synchronization issue in handshake completion might still need investigation.

The specific issue was that connection attempts were being rejected with "connection attempt already in progress" errors, suggesting the connection state tracking needs better synchronization.

Jun 14 '25 18:06 sanity

No longer relevant, I believe, re-open otherwise

Aug 03 '25 06:08 iduartgomez

freenet-core freenet-core copied to clipboard

Race condition in connection manager - handshake completion not properly synchronized

Problem

Root Cause

Current Workaround

Expected Behavior

Proposed Solutions

Impact

Files Involved

freenet-core
freenet-core copied to clipboard