libxlio issue: 4409403 Fix heap corruption since c73d96a

Description

This PR fixes a critical heap corruption issue in TCP socket timer management that was introduced in commit c73d96a3e7a9b64c77ed024fa94cf00b7fdca339.

What

Fix race condition between timer thread and socket deletion in TCP socket timers.

Why ?

A race condition was introduced in commit c73d96a that allowed the timer thread to access socket objects after they had been deleted by the event handler thread, causing heap corruption and crashes.

How ?

The fix improves synchronization between threads:

Removes sockets from timer collections while holding the socket lock
Creates a new deletion path that doesn't attempt to access timer collections after socket cleanup
Additionally fixes a lock leak in the early return path of clean_socket_obj()

Change type

What kind of change does this PR introduce?

[X] Bugfix
[ ] Feature
[ ] Code style update
[ ] Refactoring (no functional changes, no api changes)
[ ] Build related changes
[ ] CI related changes
[ ] Documentation content changes
[ ] Tests
[ ] Other

Check list

[ ] Code follows the style de facto guidelines of this project
[ ] Comments have been inserted in hard to understand places
[ ] Documentation has been updated (if necessary)
[ ] Test has been added (if possible)

Performance impact

We now hold the socket lock a bit longer during cleanup to remove it from timer collections This is a one-time cost during socket closure, not during normal operation Only affects the cleanup path, not data transmission or reception.

It can affect CPS tests, but preventing heap corruption far outweighs that minor performance impact imho.

How this bug was found?

sockperf UDP ping-pong mode recreated this issue - it was found during new config STP.
git bisect found the commit
Cursor helped with explaining the commit issues
bug found
sockperf didn't crash anymore :)

attaching a pic for a sequence diagram explaining the faulty flow that was fixed:

Apr 22 '25 12:04 tomerdbz

As we discussed offline, The "timer thread" shouldn't exist and running timers outside the internal thread (event handler thread) is problematic indeed. In the past, we removed such a flow when the tcp timer could run as part of sockinfo_tcp::tcp_con_unlock() causing issues.

First, we need to identify the flow to run tcp timer outside of the internal thread context and this flow will like need to be avoided.

Note, this comment assumes that delegated timers feature is off.

Apr 23 '25 10:04 pasis

The bug is real - but with the understanding I have now of the repo - I realize this is not the root cause - will solve this right

Oct 20 '25 17:10 tomerdbz