Filebeat (Python) Unit Tests are flaky on Buildkite macOS x86_64 agents especially on `8.13` branch
Flaky Test
-
Test Name:
tests/system/test_shutdown.pybut alsotests/system/test_registrar.py - Link: https://github.com/elastic/beats/blob/bf138189db06df45c9ef3f4a8313cafbb6dd5583/filebeat/tests/system/test_shutdown.py#L17-L31
-
Branch:
8.13but also other branches like8.14 - Artifact Link: https://buildkite.com/organizations/elastic/pipelines/filebeat/builds/6343/jobs/018f858e-f9e1-46d0-ac2a-806e13454195/artifacts/018f85a5-23d2-4008-88c8-764765abaab5
- Notes: Additional details about the test. e.g. theory as to failure cause
Stack Trace
tests/system/test_registrar.py
Buildkite failure link: https://buildkite.com/elastic/filebeat/builds/6343#018f825c-39a5-4588-8257-9bac31116e69/1804-1935
=================================== FAILURES ===================================
________________________ Test.test_restart_state_reset _________________________
self = <test_registrar.Test testMethod=test_restart_state_reset>
def test_restart_state_reset(self):
"""
Test that ttl is set to -1 after restart and no inputs covering it
"""
self.render_config_template(
path=os.path.abspath(self.working_dir) + "/log/*",
clean_inactive="10s",
ignore_older="5s"
)
os.mkdir(self.working_dir + "/log/")
testfile_path = self.working_dir + "/log/test.log"
with open(testfile_path, 'w') as testfile:
testfile.write("Hello World\n")
filebeat = self.start_beat()
# Wait until state written
> self.wait_until(
lambda: self.output_has(lines=1),
max_timeout=30)
tests/system/test_registrar.py:991:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <test_registrar.Test testMethod=test_restart_state_reset>
cond = <function Test.test_restart_state_reset.<locals>.<lambda> at 0x10eb79870>
max_timeout = 30, poll_interval = 0.1, name = 'cond', err_msg = ''
def wait_until(self, cond, max_timeout=20, poll_interval=0.1, name="cond", err_msg=""):
"""
TODO: this can probably be a "wait_until_output_count", among other things, since that could actually use `self`, and this can become an internal function
Waits until the cond function returns true,
or until the max_timeout is reached. Calls the cond
function every poll_interval seconds.
If the max_timeout is reached before cond() returns
true, an exception is raised.
"""
start = datetime.now()
while not cond():
if datetime.now() - start > timedelta(seconds=max_timeout):
print("Test has failed, here are the Beat logs")
for l in self.get_log_lines():
print(l)
> raise WaitTimeoutError(
f"Timeout waiting for condition '{name}'. Waited {max_timeout} seconds: {err_msg}")
E beat.beat.WaitTimeoutError: Timeout waiting for condition 'cond'. Waited 30 seconds:
../libbeat/tests/system/beat/beat.py:449: WaitTimeoutError
tests/system/test_shutdown.py
Buildkite failure link: https://buildkite.com/elastic/filebeat/builds/6343#018f858e-f9e1-46d0-ac2a-806e13454195/1889-1939
______________________________ Test.test_shutdown ______________________________
self = <test_shutdown.Test testMethod=test_shutdown>
@unittest.skipIf(platform.platform().startswith("Windows-7"),
"Flaky test: https://github.com/elastic/beats/issues/22795")
def test_shutdown(self):
"""
Test starting and stopping Filebeat under load.
"""
self.nasa_logs()
self.render_config_template(
path=os.path.abspath(self.working_dir) + "/log/*",
ignore_older="1h"
)
for i in range(1, 5):
proc = self.start_beat(logging_args=["-e", "-v"])
time.sleep(.5)
> proc.check_kill_and_wait()
tests/system/test_shutdown.py:31:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../libbeat/tests/system/beat/beat.py:153: in check_kill_and_wait
return self.check_wait(exit_code=exit_code)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <beat.beat.Proc object at 0x105d7a9e0>, exit_code = 0
def check_wait(self, exit_code=0):
"""
check_wait waits for the process to exit, and checks the return code of the process
"""
actual_exit_code = self.wait()
> assert actual_exit_code == exit_code, f"Expected exit code to be {exit_code}, but it was {actual_exit_code}"
E AssertionError: Expected exit code to be 0, but it was -15
../libbeat/tests/system/beat/beat.py:133: AssertionError
- generated xml file: /Users/admin/builds/bk-agent-prod-orka-1715932705515043799/elastic/filebeat/filebeat/build/TEST-python-unit.xml -
============================= slowest 20 durations =============================
27.73s call filebeat/tests/system/test_crawler.py::Test::test_encodings
15.23s call filebeat/tests/system/test_harvester.py::Test::test_close_timeout
10.66s call filebeat/tests/system/test_reload_inputs.py::Test::test_reload_same_input
9.70s call filebeat/tests/system/test_stdin.py::Test::test_stdin_eof
9.20s call filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl_with_space
8.49s call filebeat/tests/system/test_registrar.py::Test::test_restart_state
8.15s call filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl_no_clean_inactive
8.05s call filebeat/tests/system/test_registrar.py::Test::test_registry_file_default_permissions
7.30s call filebeat/tests/system/test_registrar.py::Test::test_rotating_file_with_restart
7.28s call filebeat/tests/system/test_registrar.py::Test::test_rotating_file_inode
7.03s call filebeat/tests/system/test_registrar.py::Test::test_restart_state_reset_ttl
6.95s call filebeat/tests/system/test_reload_inputs.py::Test::test_reload_add
6.91s call filebeat/tests/system/test_reload_inputs.py::Test::test_start_stop
6.87s call filebeat/tests/system/test_registrar.py::Test::test_state_after_rotation_ignore_older
6.82s call filebeat/tests/system/test_reload_inputs.py::Test::test_reload_same_config
6.54s call filebeat/tests/system/test_harvester.py::Test::test_decode_error
6.48s call filebeat/tests/system/test_crawler.py::Test::test_utf8
6.36s call filebeat/tests/system/test_input.py::Test::test_rotating_close_inactive_low_write_rate
6.27s call filebeat/tests/system/test_input.py::Test::test_harvester_limit
6.13s call filebeat/tests/system/test_input.py::Test::test_input_filter_dropfields
=========================== short test summary info ============================
FAILED tests/system/test_shutdown.py::Test::test_shutdown - AssertionError: Expected exit code to be 0, but it was -15
============ 1 failed, 189 passed, 175 skipped in 608.83s (0:10:08) ============
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Also seen in CI runs for merge commits against main e.g. in https://buildkite.com/elastic/filebeat/builds/6466
@pierrehilbert this is something that is affecting your CI metrics.
This test is related to this https://elastic.slack.com/archives/C047NDNGUMU/p1717091056290869. On MacOS, the FQDN lookup takes some time (~5 seconds) before signal handlers are even set up.
We can fix this behaviour in either of the following ways:
- Wait for a bit longer (~10 seconds) for MacOS for the
filebeatto start up. - Skip this test, as the underlying FQDN is itself flaky and throws warning.
cc: @pierrehilbert
Let's go with the option 1 and if this is not solving the issue, let's skip the test only on MacOS