issue: 4724535 Fix TSO SS aggressive cwnd/ssthresh
Description
Resolve 30-second throughput ramp-up issue for TSO-enabled TCP connections by implementing proper initial congestion window (cwnd) and slow start threshold (ssthresh) values.
Problem: Applications using TSO experienced ~30 seconds of near-zero throughput before achieving line-rate. Debug analysis revealed that ssthresh was being unconditionally reset to 10*MSS (14,600 bytes) during SYN-ACK processing in tcp_in.c line 582, forcing TCP into congestion avoidance mode immediately. This caused linear cwnd growth instead of exponential slow start, resulting in extremely slow ramp-up.
Solution: Created centralized helper function tcp_set_initial_cwnd_ssthresh() that sets TSO-aware parameters:
For TSO-enabled connections:
- cwnd = TSO_max_payload / 4 (64KB with default 256KB TSO)
- ssthresh = 0x7FFFFFFF (2GB - effectively unlimited)
For non-TSO connections:
- cwnd = RFC 3390 compliant: min(4MSS, max(2MSS, 4380 bytes))
- ssthresh = 10 * MSS
Technical Rationale:
-
Very high ssthresh (2GB) follows industry best practices, allowing slow start to run until network conditions dictate otherwise rather than artificially limiting growth (Excentis research on optimizing TCP for gigabit networks).
-
TSO max payload is independent of negotiated MSS (determined by hardware capabilities), so initial window should also be independent of MSS for TSO connections.
-
Initial cwnd of 64KB (TSO_max/4) balances aggressive throughput with conservative buffer management. This exceeds RFC 6928's recommendation of 10 segments (~15KB) but is appropriate for XLIO's controlled environment where TSO hardware handles segmentation and applications target high-throughput scenarios. Empirically verified to achieve 200 Gbps in <1 second.
Implementation Details:
- Replaced duplicate TSO initialization logic in 6 locations:
- tcp_pcb_init() - initial PCB setup
- tcp_pcb_recycle() - PCB reuse after TIME_WAIT
- tcp_connect() - client-side connection initiation
- tcp_in.c SYN-ACK handler - CRITICAL FIX (line 584)
- lwip_conn_init() - LWIP CC module initialization
- cubic_conn_init() - Cubic CC module initialization
Performance Impact: Before: 20+ seconds to reach line-rate (200 Gbps) After: Line-rate achieved in <1 second
Verification: GDB debugging confirmed ssthresh was being overwritten during SYN-ACK processing. After fix, cwnd=64KB and ssthresh=2GB are maintained throughout connection establishment, enabling exponential growth as designed.
References:
- RFC 3390: Increasing TCP's Initial Window
- RFC 5681: TCP Congestion Control
- RFC 6928: Increasing TCP's Initial Window (10 segments standard)
- Excentis: "Optimizing TCP Congestion Avoidance Parameters for Gigabit Networks" - recommends very high ssthresh (approaching 2^31) for fast networks
- NASA: "Performance Analysis of TCP with Large Segmentation Offload"
- analysis of TSO impact on congestion control
What
Fix TSO slow start with aggressive cwnd/ssthresh
Why ?
xlio_benchmark to alpha.
How ?
It is optional but for complex PRs please provide information about the design, architecture, approach, etc.
Change type
What kind of change does this PR introduce?
- [ ] Bugfix
- [ ] Feature
- [ ] Code style update
- [ ] Refactoring (no functional changes, no api changes)
- [ ] Build related changes
- [ ] CI related changes
- [ ] Documentation content changes
- [ ] Tests
- [ ] Other
Check list
- [ ] Code follows the style de facto guidelines of this project
- [ ] Comments have been inserted in hard to understand places
- [ ] Documentation has been updated (if necessary)
- [ ] Test has been added (if possible)
Greptile Overview
Greptile Summary
This PR fixes a critical TCP slow start performance issue for TSO-enabled connections that caused 30+ second ramp-up times. The root cause was aggressive initial window values being overwritten during SYN-ACK processing in tcp_in.c:582.
Key Changes:
- Created centralized helper functions
tcp_set_initial_cwnd_ssthresh()andtcp_reset_cwnd_on_congestion()intcp.cto manage TSO-aware congestion control parameters - Fixed both CUBIC and LWIP congestion control algorithms to use proper exponential slow start (
cwnd += ackedinstead ofcwnd += mss) with overflow protection - Added new configuration option
tcp_cc_tso_aware(default: true) to enable/disable TSO-aware optimizations - TSO-aware mode uses aggressive initial window (cwnd=64KB, ssthresh=2GB) that is independent of negotiated MSS
- Replaced 6 duplicate initialization sites with centralized helper function calls
Previous Thread Resolution:
The developer has addressed the division-by-zero concerns raised in previous comments. The helper functions use if (tcp_tso(pcb) && ...) checks before any division operations. When max_payload_sz is 0 (during tcp_pcb_init()), the condition evaluates to false and the RFC 3390 path is taken. TSO is configured later in sockinfo_tcp.cpp:1149, after which the helper is called again with proper values.
The overflow protection (if ((u32_t)(pcb->cwnd + pcb->acked) > pcb->cwnd)) has been correctly added to both CUBIC and LWIP implementations, matching the pattern used in the original LWIP code.
Confidence Score: 4/5
- This PR is safe to merge with minor concerns about LWIP_ASSERT usage that are mitigated by the empty macro definition
- The implementation correctly addresses the performance issue with proper safeguards. Score is 4/5 due to: (1) LWIP_ASSERT statements that check
max_payload_sz > 0after conditional checks - while safe because LWIP_ASSERT is empty, this creates potential confusion; (2) deviates significantly from RFC 5681 and modern CUBIC with aggressive TSO-aware parameters, though this is intentional and well-documented; (3) the fix changes fundamental TCP behavior which requires thorough testing in production environments - Pay close attention to
src/core/lwip/tcp.c(lines 527-641) where the core logic resides, and verify TSO configuration timing insrc/core/sock/sockinfo_tcp.cpp:1149
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/core/lwip/tcp.c | 4/5 | Added centralized helper functions tcp_set_initial_cwnd_ssthresh() and tcp_reset_cwnd_on_congestion() to implement TSO-aware congestion control with proper safeguards against division-by-zero through conditional checks |
| src/core/lwip/cc_cubic.c | 5/5 | Fixed slow start to use proper exponential growth (cwnd += acked) with overflow protection, and replaced duplicate congestion recovery logic with centralized helper function |
| src/core/lwip/cc_lwip.c | 5/5 | Fixed slow start to use proper exponential growth (cwnd += acked) with overflow protection, replaced duplicate congestion signal handling with centralized function |
| src/core/lwip/tcp_in.c | 5/5 | Replaced hardcoded ssthresh initialization in SYN-ACK handler with call to centralized helper function, ensuring TSO-aware values are maintained |
| src/core/util/sys_vars.cpp | 5/5 | Added C-compatible accessor function get_tcp_cc_tso_aware() and configuration handling for new tcp_cc_tso_aware option with default value true |
| src/core/util/sys_vars.h | 5/5 | Added tcp_cc_tso_aware boolean field and environment variable definition XLIO_TCP_CC_TSO_AWARE with C-compatible accessor declaration |
Sequence Diagram
sequenceDiagram
participant App as Application
participant SI as sockinfo_tcp
participant TCP as tcp.c
participant TI as tcp_in.c
participant CC as CC Module (cubic/lwip)
participant SV as sys_vars
Note over App,SV: Connection Initialization Flow
App->>SI: socket() + bind()
SI->>TCP: tcp_pcb_init(pcb)
TCP->>TCP: memset(pcb, 0) → tso.max_payload_sz = 0
TCP->>SV: get_tcp_cc_tso_aware()
SV-->>TCP: returns config value
TCP->>TCP: tcp_set_initial_cwnd_ssthresh(pcb)
Note over TCP: tcp_tso(pcb) = 0 (TSO not configured yet)<br/>Falls through to RFC 3390 path:<br/>cwnd = 4*MSS, ssthresh = 10*MSS
TCP->>CC: cc_init(pcb)
Note over CC: If cwnd==1, calls tcp_set_initial_cwnd_ssthresh again<br/>(no-op since cwnd already set)
App->>SI: connect()
SI->>TCP: tcp_connect(pcb)
SI->>SI: Configure TSO: pcb.tso.max_payload_sz = 256KB
TCP->>TCP: tcp_set_initial_cwnd_ssthresh(pcb)
Note over TCP: NOW tcp_tso(pcb) = 256KB<br/>TSO-aware path:<br/>cwnd = 64KB, ssthresh = 2GB
TCP->>TI: Send SYN
Note over TI: Remote Host Responds
TI->>TI: Receive SYN-ACK
TI->>TI: MSS negotiation (e.g., MSS=1460)
TI->>TCP: tcp_set_initial_cwnd_ssthresh(pcb)
Note over TCP: Re-initialize after MSS negotiation<br/>TSO still enabled:<br/>cwnd = 64KB (unchanged, MSS-independent)<br/>ssthresh = 2GB
TI->>CC: cc_conn_init(pcb)
Note over CC: If cwnd==1, would reinitialize<br/>But cwnd=64KB, so preserves values
Note over App,SV: Data Transfer (Slow Start)
App->>TCP: tcp_write() + tcp_output()
TCP->>TI: Send segments
TI->>TI: Receive ACKs
TI->>CC: cc_ack_received(pcb, CC_ACK)
Note over CC: cwnd <= ssthresh (64KB < 2GB)<br/>Slow start: cwnd += acked<br/>Exponential growth with overflow check
Note over App,SV: RTO/Congestion Event
TI->>CC: cc_cong_signal(pcb, CC_RTO)
CC->>TCP: tcp_reset_cwnd_on_congestion(pcb, is_rto=true)
Note over TCP: TSO-aware recovery:<br/>cwnd = 26KB (10% of TSO max)<br/>ssthresh = max(cwnd/2, 64KB)
@pasis - could you please review this?
bot:retest