Introduce testing for identifying regressions which can introduce ttrpc deadlocks
As detailed in https://github.com/containerd/ttrpc/issues/72, in the past, ttrpc has encountered deadlocks on the server side or client side when mismatching versions were used.
| Version Range | Description | Comments |
|---|---|---|
| v1.0.2 and before | Original deadlock bug | #94 for fixing deadlock in v1.1.0 |
| v1.1.0 - v1.2.0 | No known deadlock bugs | |
| v1.2.0 - v1.2.4 | Streaming with a new deadlock bug | #107 introduced deadlock in v1.2.0 |
| After v1.2.4 | No known deadlock bugs | #168 for fixing deadlock in v1.2.4 |
While the current version of ttrpc does not display any deadlocks, we want to introduce a CI regression test which can test the current code against the following older versions in both server and client scenarios:
-
v1.0.2 -
v1.1.0 -
v1.2.0 -
v1.2.4 -
latest
This issue is filed for discussions related to the plan of how such a matrix testing can be introduced for ttrpc.
The objective is to test the latest code against older versions (v1.0.2, v1.1.0, v1.2.0, v1.2.4, current code) by running a matrix of tests to identify potential deadlocks caused by version mismatches. The tests will involve running the latest code as the server with older versions as clients and vice versa.
To test different versions of ttrpc against the latest code, we encounter a challenge with circular dependencies, which are not supported by Go. Using nested Go modules could bypass this limitation, but it introduces a significant drawback: we would only be able to stress test changes when a new tag is created in the package. Consequently, this approach prevents testing changes in each PR. If a potential deadlock is identified in a release, we would need to scan through all the changes to locate the regression, making debugging more time-consuming and less efficient.
Proposed Approach
The proposed approach involves creating a stress tool based on the latest code, backporting it to older releases, and automating tests via a script that builds and orchestrates tests across versions. This script, integrated into GitHub Actions via a make target, ensures early detection of compatibility issues during CI testing.
Steps Involved in the Approach:
Development of the Stress Tool:
- A dedicated
stresstool will be created, designed to depend on the latest code changes. - This tool will be integrated into the mainline codebase to ensure it remains updated with ongoing development.
-
stresstool can be ran as aserverorclient. It represents a simple client-server interaction where the client sends continuous high-volume requests to the server, and the server responds with the same data, allowing for testing of concurrent request handling and response verification.
Backporting the Stress Tool:
Once the stress tool is created, it will be backported into the older releases against which we need to test the latest code.
For now these would be v1.0.2, v1.1.0, v1.2.0, and v1.2.4 versions. To accomplish the same, we would create branches out of the version tagged code. Check-in the stress tool and then cut a new tag.
Script for Automation:
A script will be developed in the mainline codebase to automate the testing process. This script will handle the following tasks:
- Build the
stresstool using the latest code from the branch the script is executing from. - Pull and build the corresponding
stresstool versions from the identified older releases. - Orchestrate the testing by executing a matrix of tests where the latest tool interacts with older versions and vice versa.
- For each pair of server and client version, the script will run the
stresstool for a given number of iterations with the specified number of workers (say100000and100respectively). If the test is not completed within a specified time period (say5 minutes) then the script will terminate the test and exit with an appropriate error code, signaling the failure.
Integration with CI/CD Pipeline:
- To ensure continuous validation, the stress tool will be invoked through a make target.
- This make target will be incorporated into the GitHub Actions workflow as part of the Continuous Integration (CI) process. During CI testing, the script will execute the stress tests automatically, providing immediate feedback on the compatibility and robustness of the latest changes against older releases.
@dmcgowan Please do take a look and provide feedback when you can!
@dmcgowan Hi Derek, Do you have any concerns or feedback on the stress testing work proposed by @rawahars ?
The test matrix idea sounds generally good to me. However, I'm not sure if maintaining the stress tool in the ttrpc repository in our release branches is the best idea; that would imply that every time we need to modify stress we then need to backport to N branches and cut N new tags even if there are no functional ttrpc changes.
What about moving stress to another repository and maintaining a single branch there that can use multiple versions of ttrpc? This could look something like:
<stress repo>
├── testsuite
│ └── suite.go
├── ttrpc102
│ ├── main.go
│ ├── go.mod
│ └── go.sum
├── ttrpc110
│ ├── main.go
│ ├── go.mod
│ └── go.sum
├── ttrpc120
│ ├── main.go
│ ├── go.mod
│ └── go.sum
├── ttrpc124
│ ├── main.go
│ ├── go.mod
│ └── go.sum
└── ttrpclatest
├── main.go
├── go.mod
└── go.sum
Then the logic can be updated in one place, but we have N directories each with a harness for that specific version of ttrpc under test. We can then make this available via GitHub Actions and a local make target within the ttrpc repo, but it keeps the actual test code separate.
@samuelkarp This idea sounds awesome to me! I can take ownership of the work needed to accomplish this.
Can you please help with the process to be followed for creating a new repository within containerd organization?
Alternatively, I can request a repository in another Github organization too. It will anyhow be used for testing only in ttrpc package.
It would be really great if you or someone from containerd can help/suggest about the same!
Maybe you can start developing this in another repository (under your personal account? or another organization?) and once you're ready for it to be integrated we can move it over?
@samuelkarp Thanks for the suggestion! I have started looking into this issue.
Then the logic can be updated in one place, but we have N directories each with a harness for that specific version of ttrpc under test. We can then make this available via GitHub Actions and a local make target within the ttrpc repo, but it keeps the actual test code separate.
Unfortunately, the earlier suggestion of maintaining N modules, each corresponding to a specific version of ttrpc does not seem to work as expected. In the test module, we would be importing the above N modules and since each of these N module has a conflicting ttrpc version, the Go build system will resolve it to a specific ttrpc version.
We have the following two alternatives and I would really appreciate your inputs on the same-
- We vendor-in the ttrpc versions in each of the N module, so that the uber
testmodule does not see any conflict and allows to import the N modules with their specificttrpcversions. - We build N binaries which are built with their own
ttrpcversions. Then we create uber orchestration scripts which perform the matrix test. We can possibly automate it using a Go test.
What do you think?
I think option 2 of separate binaries per ttrpc version makes the most sense to me.
@samuelkarp Thanks for your response! I have made the changes in a branch in my own personal repository. https://github.com/rawahars/ttrpc-stress/pull/1
Can you please take a look when you can?