ttrpc Introduce testing for identifying regressions which can introduce ttrpc deadlocks

As detailed in https://github.com/containerd/ttrpc/issues/72, in the past, ttrpc has encountered deadlocks on the server side or client side when mismatching versions were used.

Version Range	Description	Comments
v1.0.2 and before	Original deadlock bug	#94 for fixing deadlock in `v1.1.0`
v1.1.0 - v1.2.0	No known deadlock bugs
v1.2.0 - v1.2.4	Streaming with a new deadlock bug	#107 introduced deadlock in `v1.2.0`
After v1.2.4	No known deadlock bugs	#168 for fixing deadlock in `v1.2.4`

While the current version of ttrpc does not display any deadlocks, we want to introduce a CI regression test which can test the current code against the following older versions in both server and client scenarios:

v1.0.2
v1.1.0
v1.2.0
v1.2.4
latest

This issue is filed for discussions related to the plan of how such a matrix testing can be introduced for ttrpc.

Jan 18 '25 13:01 rawahars

The objective is to test the latest code against older versions (v1.0.2, v1.1.0, v1.2.0, v1.2.4, current code) by running a matrix of tests to identify potential deadlocks caused by version mismatches. The tests will involve running the latest code as the server with older versions as clients and vice versa.

To test different versions of ttrpc against the latest code, we encounter a challenge with circular dependencies, which are not supported by Go. Using nested Go modules could bypass this limitation, but it introduces a significant drawback: we would only be able to stress test changes when a new tag is created in the package. Consequently, this approach prevents testing changes in each PR. If a potential deadlock is identified in a release, we would need to scan through all the changes to locate the regression, making debugging more time-consuming and less efficient.

Proposed Approach

The proposed approach involves creating a stress tool based on the latest code, backporting it to older releases, and automating tests via a script that builds and orchestrates tests across versions. This script, integrated into GitHub Actions via a make target, ensures early detection of compatibility issues during CI testing.

Steps Involved in the Approach:

Development of the Stress Tool:

A dedicated stress tool will be created, designed to depend on the latest code changes.
This tool will be integrated into the mainline codebase to ensure it remains updated with ongoing development.
stress tool can be ran as a server or client . It represents a simple client-server interaction where the client sends continuous high-volume requests to the server, and the server responds with the same data, allowing for testing of concurrent request handling and response verification.

Backporting the Stress Tool:

Once the stress tool is created, it will be backported into the older releases against which we need to test the latest code. For now these would be v1.0.2, v1.1.0, v1.2.0, and v1.2.4 versions. To accomplish the same, we would create branches out of the version tagged code. Check-in the stress tool and then cut a new tag.

Script for Automation:

A script will be developed in the mainline codebase to automate the testing process. This script will handle the following tasks:

Build the stress tool using the latest code from the branch the script is executing from.
Pull and build the corresponding stress tool versions from the identified older releases.
Orchestrate the testing by executing a matrix of tests where the latest tool interacts with older versions and vice versa.
For each pair of server and client version, the script will run the stress tool for a given number of iterations with the specified number of workers (say 100000 and 100 respectively). If the test is not completed within a specified time period (say 5 minutes) then the script will terminate the test and exit with an appropriate error code, signaling the failure.

Integration with CI/CD Pipeline:

To ensure continuous validation, the stress tool will be invoked through a make target.
This make target will be incorporated into the GitHub Actions workflow as part of the Continuous Integration (CI) process. During CI testing, the script will execute the stress tests automatically, providing immediate feedback on the compatibility and robustness of the latest changes against older releases.

Jan 20 '25 09:01 rawahars

@dmcgowan Please do take a look and provide feedback when you can!

Jan 20 '25 09:01 rawahars

@dmcgowan Hi Derek, Do you have any concerns or feedback on the stress testing work proposed by @rawahars ?

Feb 10 '25 20:02 kiashok

The test matrix idea sounds generally good to me. However, I'm not sure if maintaining the stress tool in the ttrpc repository in our release branches is the best idea; that would imply that every time we need to modify stress we then need to backport to N branches and cut N new tags even if there are no functional ttrpc changes.

What about moving stress to another repository and maintaining a single branch there that can use multiple versions of ttrpc? This could look something like:

<stress repo>
├── testsuite
│   └── suite.go
├── ttrpc102
│   ├── main.go
│   ├── go.mod
│   └── go.sum
├── ttrpc110
│   ├── main.go
│   ├── go.mod
│   └── go.sum
├── ttrpc120
│   ├── main.go
│   ├── go.mod
│   └── go.sum
├── ttrpc124
│   ├── main.go
│   ├── go.mod
│   └── go.sum
└── ttrpclatest
    ├── main.go
    ├── go.mod
    └── go.sum

Then the logic can be updated in one place, but we have N directories each with a harness for that specific version of ttrpc under test. We can then make this available via GitHub Actions and a local make target within the ttrpc repo, but it keeps the actual test code separate.

Feb 10 '25 21:02 samuelkarp

@samuelkarp This idea sounds awesome to me! I can take ownership of the work needed to accomplish this.

Can you please help with the process to be followed for creating a new repository within containerd organization? Alternatively, I can request a repository in another Github organization too. It will anyhow be used for testing only in ttrpc package.

It would be really great if you or someone from containerd can help/suggest about the same!

Feb 13 '25 09:02 rawahars

Maybe you can start developing this in another repository (under your personal account? or another organization?) and once you're ready for it to be integrated we can move it over?

Feb 18 '25 19:02 samuelkarp

@samuelkarp Thanks for the suggestion! I have started looking into this issue.

Then the logic can be updated in one place, but we have N directories each with a harness for that specific version of ttrpc under test. We can then make this available via GitHub Actions and a local make target within the ttrpc repo, but it keeps the actual test code separate.

Unfortunately, the earlier suggestion of maintaining N modules, each corresponding to a specific version of ttrpc does not seem to work as expected. In the test module, we would be importing the above N modules and since each of these N module has a conflicting ttrpc version, the Go build system will resolve it to a specific ttrpc version.

We have the following two alternatives and I would really appreciate your inputs on the same-

We vendor-in the ttrpc versions in each of the N module, so that the uber test module does not see any conflict and allows to import the N modules with their specific ttrpc versions.
We build N binaries which are built with their own ttrpc versions. Then we create uber orchestration scripts which perform the matrix test. We can possibly automate it using a Go test.

What do you think?

Feb 28 '25 13:02 rawahars

I think option 2 of separate binaries per ttrpc version makes the most sense to me.

Mar 12 '25 17:03 samuelkarp

@samuelkarp Thanks for your response! I have made the changes in a branch in my own personal repository. https://github.com/rawahars/ttrpc-stress/pull/1

Can you please take a look when you can?

Mar 17 '25 22:03 rawahars