ClickHouse icon indicating copy to clipboard operation
ClickHouse copied to clipboard

[Green CI] Fix test test_storage_s3_queue/test.py::test_max_set_age

Open pamarcos opened this issue 1 year ago • 3 comments

Fix test_storage_s3_queue/test.py::test_max_set_age

  • Use unique filenames to prevent following tests failing.
  • Use a busy-loop instead of a sleep to find the condition is met ASAP.
  • Increase from 10s to 20s the max_age. Fixes several issues for time-sensitive checks. For instance, in case the CPU is more overloaded than usual (for instance, because it's executing several heavy parallel tests), the time elapsed between some close test steps might get even surpass max_age when it's not expected to do so. This makes the S3Queue to do an extra insertion, invalidating lot of the assumptions of the test.
  • Remove redundant checks.

Apart from fixing the test itself:

  • Update integration test documentation, adding more information about parallel tests
  • Use a different instance name when using pytest-xdist to allow parallel test execution

I've run hundreds of times the test in parallel with the following command with no flakiness found:

tests/integration/runner test_storage_s3_queue/test.py::test_max_set_age -- --count 10000 -n 16

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Closes #13110

pamarcos avatar Jul 24 '24 12:07 pamarcos

Use unique names for tables and files to prevent parallel tests affecting. This is what actually fixes the test

AFAIK tests in a single test.py can be parallelized, but in this case they will be run with different clickhouse servers. And within each such batch of tests which is run on a single clickhouse server, they are NOT run in parallel. So it does not matter what the names of the tables are as long as they are dropped after the test or before each further test, which is already done. Frankly speaking I think if this test fails, then it is not a bug in the test, but a bug in the code.

kssenii avatar Jul 24 '24 12:07 kssenii

This is an automated comment for commit 15e0033016eb0e23a7e6f512d5096e50863e3187 with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

robot-clickhouse-ci-1 avatar Jul 24 '24 13:07 robot-clickhouse-ci-1

I've left my AWS dev machine all night running the test and I got 8300 successes and no failures. I checked out at https://github.com/ClickHouse/ClickHouse/pull/67035/commits/5b9427dc566b1e76a82be9f650b9dbc386567fc4 to have some custom traces to help debugging in case something went south.

This is the script that I used, for reference. Running the test in 10 parallel processes in batches of 100:

!/bin/bash

set -uo pipefail

if [[ $(docker ps | grep -v CREATED) ]]; then
    docker ps | grep -v CREATED | awk '{print $1}' | xargs docker kill
fi

export CLICKHOUSE_TESTS_CLIENT_BIN_PATH=/home/ubuntu/ClickHouse/ClickHouse/build/programs/clickhouse
export CLICKHOUSE_TESTS_SERVER_BIN_PATH=/home/ubuntu/ClickHouse/ClickHouse/build/programs/clickhouse
export CLICKHOUSE_TESTS_BASE_CONFIG_DIR=/home/ubuntu/ClickHouse/ClickHouse/programs/server

find tests/integration -iname "pytest-gw*.log" 2> /dev/null | xargs rm || true

for _ in {1..1000}; do
    echo "Starting batch of 1000 tests..."
    tests/integration/runner test_storage_s3_queue/test.py::test_max_set_age -- --count 100 -n 10 2>&1 | tee -a test_max_set_age.log
    echo "Log results..."
    grep -inR "PMO" tests/integration/pytest*.log | tee -a test_max_set_age.log
done

And here's the log file: test_max_set_age.zip

pamarcos avatar Jul 30 '24 08:07 pamarcos

@pamarcos, looks like it breaks test_zookeeper_config_load_balancing somehow

tavplubix avatar Jul 31 '24 22:07 tavplubix

I find it strange that the CI Check is green, however the integration tests were still running as far as I see.

antaljanosbenjamin avatar Jul 31 '24 22:07 antaljanosbenjamin

@pamarcos, looks like it breaks test_zookeeper_config_load_balancing somehow

It's because of the change in cluster.py. Let me create a new PR to remove those lines and I'll investigate in a separate PR how to make that particular test work while having them.

I also find it very strange that the CI for this PR didn't catch it.

pamarcos avatar Aug 01 '24 09:08 pamarcos