[Green CI] Fix test test_storage_s3_queue/test.py::test_max_set_age
Fix test_storage_s3_queue/test.py::test_max_set_age
- Use unique filenames to prevent following tests failing.
- Use a busy-loop instead of a sleep to find the condition is met ASAP.
- Increase from 10s to 20s the max_age. Fixes several issues for time-sensitive checks. For instance, in case the CPU is more overloaded than usual (for instance, because it's executing several heavy parallel tests), the time elapsed between some close test steps might get even surpass max_age when it's not expected to do so. This makes the S3Queue to do an extra insertion, invalidating lot of the assumptions of the test.
- Remove redundant checks.
Apart from fixing the test itself:
- Update integration test documentation, adding more information about parallel tests
- Use a different instance name when using
pytest-xdistto allow parallel test execution
I've run hundreds of times the test in parallel with the following command with no flakiness found:
tests/integration/runner test_storage_s3_queue/test.py::test_max_set_age -- --count 10000 -n 16
Changelog category (leave one):
- CI Fix or Improvement (changelog entry is not required)
Closes #13110
Use unique names for tables and files to prevent parallel tests affecting. This is what actually fixes the test
AFAIK tests in a single test.py can be parallelized, but in this case they will be run with different clickhouse servers. And within each such batch of tests which is run on a single clickhouse server, they are NOT run in parallel. So it does not matter what the names of the tables are as long as they are dropped after the test or before each further test, which is already done. Frankly speaking I think if this test fails, then it is not a bug in the test, but a bug in the code.
This is an automated comment for commit 15e0033016eb0e23a7e6f512d5096e50863e3187 with description of existing statuses. It's updated for the latest CI running
✅ Click here to open a full report in a separate page
Successful checks
| Check name | Description | Status |
|---|---|---|
| Builds | There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS | ✅ success |
| Fast test | Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here | ✅ success |
| Style check | Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report | ✅ success |
| Unit tests | Runs the unit tests for different release types | ✅ success |
I've left my AWS dev machine all night running the test and I got 8300 successes and no failures. I checked out at https://github.com/ClickHouse/ClickHouse/pull/67035/commits/5b9427dc566b1e76a82be9f650b9dbc386567fc4 to have some custom traces to help debugging in case something went south.
This is the script that I used, for reference. Running the test in 10 parallel processes in batches of 100:
!/bin/bash
set -uo pipefail
if [[ $(docker ps | grep -v CREATED) ]]; then
docker ps | grep -v CREATED | awk '{print $1}' | xargs docker kill
fi
export CLICKHOUSE_TESTS_CLIENT_BIN_PATH=/home/ubuntu/ClickHouse/ClickHouse/build/programs/clickhouse
export CLICKHOUSE_TESTS_SERVER_BIN_PATH=/home/ubuntu/ClickHouse/ClickHouse/build/programs/clickhouse
export CLICKHOUSE_TESTS_BASE_CONFIG_DIR=/home/ubuntu/ClickHouse/ClickHouse/programs/server
find tests/integration -iname "pytest-gw*.log" 2> /dev/null | xargs rm || true
for _ in {1..1000}; do
echo "Starting batch of 1000 tests..."
tests/integration/runner test_storage_s3_queue/test.py::test_max_set_age -- --count 100 -n 10 2>&1 | tee -a test_max_set_age.log
echo "Log results..."
grep -inR "PMO" tests/integration/pytest*.log | tee -a test_max_set_age.log
done
And here's the log file: test_max_set_age.zip
@pamarcos, looks like it breaks test_zookeeper_config_load_balancing somehow
I find it strange that the CI Check is green, however the integration tests were still running as far as I see.
@pamarcos, looks like it breaks
test_zookeeper_config_load_balancingsomehow
It's because of the change in cluster.py. Let me create a new PR to remove those lines and I'll investigate in a separate PR how to make that particular test work while having them.
I also find it very strange that the CI for this PR didn't catch it.