freenet-core
freenet-core copied to clipboard
Race condition in connection manager - handshake completion not properly synchronized
Problem
The test test_gw_to_peer_outbound_conn_forwarded was failing in CI due to a race condition where connection establishment events are not properly synchronized. The test required a 500ms delay to work around timing issues.
Root Cause
- Non-atomic connection management: add_connection() may return before the connection is fully available for forwarding operations
- Asynchronous event handling without proper synchronization: The handshake handler processes events sequentially but doesn't ensure proper ordering/completion
- Event emission timing: Events may be emitted before the underlying operations are fully complete
Current Workaround
Added a 500ms delay in the test between establishing peer connection and joiner connection. This is not acceptable for a production system focused on efficiency.
Expected Behavior
- Connection establishment should be atomic and deterministic
- Once add_connection() returns, the connection should be immediately available for forwarding
- No time-based delays should be required for proper operation
- Connection establishment latency should be minimal and predictable
Proposed Solutions
- Synchronous connection management: Make add_connection() atomic and blocking until connection is fully ready
- Event completion signals: Only emit events when operations are fully complete
- Proper async coordination: Use async synchronization primitives (channels, barriers, etc.) instead of time delays
- Connection state tracking: Implement proper state machine for connection lifecycle
Impact
- Performance: 500ms delays are unacceptable for a distributed system
- Reliability: Race conditions can cause intermittent failures
- Scalability: Poor connection management affects network topology building
- Developer experience: Flaky tests waste CI resources and developer time
Files Involved
- crates/core/src/node/network_bridge/handshake.rs (test with workaround)
- crates/core/src/ring/connection_manager.rs (connection management)
- crates/core/src/operations/connect.rs (forwarding logic)
This issue should be prioritized as it affects core networking functionality and violates Freenet's efficiency design goals.
This race condition was likely caused by the channel overflow issue. Fixed by implementing proper backpressure handling with send() instead of try_send().
I thinkl this would be good to handle better though, probably should keep in the backlog.
Reopening per @iduartgomez's suggestion. While the channel overflow fix may have reduced the likelihood of this race condition, the underlying synchronization issue in handshake completion might still need investigation.
The specific issue was that connection attempts were being rejected with "connection attempt already in progress" errors, suggesting the connection state tracking needs better synchronization.
No longer relevant, I believe, re-open otherwise