issue: 4409403 Fix heap corruption since c73d96a
Description
This PR fixes a critical heap corruption issue in TCP socket timer management that was introduced in commit c73d96a3e7a9b64c77ed024fa94cf00b7fdca339.
What
Fix race condition between timer thread and socket deletion in TCP socket timers.
Why ?
A race condition was introduced in commit c73d96a that allowed the timer thread to access socket objects after they had been deleted by the event handler thread, causing heap corruption and crashes.
How ?
The fix improves synchronization between threads:
- Removes sockets from timer collections while holding the socket lock
- Creates a new deletion path that doesn't attempt to access timer collections after socket cleanup
- Additionally fixes a lock leak in the early return path of clean_socket_obj()
Change type
What kind of change does this PR introduce?
- [X] Bugfix
- [ ] Feature
- [ ] Code style update
- [ ] Refactoring (no functional changes, no api changes)
- [ ] Build related changes
- [ ] CI related changes
- [ ] Documentation content changes
- [ ] Tests
- [ ] Other
Check list
- [ ] Code follows the style de facto guidelines of this project
- [ ] Comments have been inserted in hard to understand places
- [ ] Documentation has been updated (if necessary)
- [ ] Test has been added (if possible)
Performance impact
We now hold the socket lock a bit longer during cleanup to remove it from timer collections This is a one-time cost during socket closure, not during normal operation Only affects the cleanup path, not data transmission or reception.
It can affect CPS tests, but preventing heap corruption far outweighs that minor performance impact imho.
How this bug was found?
- sockperf UDP ping-pong mode recreated this issue - it was found during new config STP.
- git bisect found the commit
- Cursor helped with explaining the commit issues
- bug found
- sockperf didn't crash anymore :)
attaching a pic for a sequence diagram explaining the faulty flow that was fixed:
As we discussed offline, The "timer thread" shouldn't exist and running timers outside the internal thread (event handler thread) is problematic indeed. In the past, we removed such a flow when the tcp timer could run as part of sockinfo_tcp::tcp_con_unlock() causing issues.
First, we need to identify the flow to run tcp timer outside of the internal thread context and this flow will like need to be avoided.
Note, this comment assumes that delegated timers feature is off.
The bug is real - but with the understanding I have now of the repo - I realize this is not the root cause - will solve this right