KAFKA-6332: Kafka system tests should use nc instead of log grep to d…
…etect start-up
- Extracted a new function (listening) in system test utils module to test whether a specified port on a specified node is listening for connections.
- Refactored system tests to use the new function to test whether a particular server is started / listening on a port (instead of grepping for lines in server logs).
- Specified kdc.port for the MiniKdc server in minikdc.properties so that the server does not listen for connections on a "random" port.
- Fixed a typo in the documentation of the node_is_reachable function in utils/util.py
Testing done:
- Executed the following system tests: sanity_checks tests, simple_consumer_shell_test.py, consumer_group_command_test.py, trogdor_test.py, zookeeper_security_upgrade_test.py
Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
@ewencp you had originally asked for this, does it look good to you?
LGTM.
It's good that you assigned MiniKDC a defined port as well.
Many thanks for reviewing and approving the changes. Please let me know if there is anything else I need to further investigate / fix related to this task.
@asasvari I was trying to get a clean system test build, but haven't been able to so far. Here's the report from a recent run: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2018-02-23--001.1519386952--asasvari--KAFKA-6332--b40a9f3/report.html Some of them look unrelated (e.g. some streams failures), but others have messages about not seeing services start up which seems potentially related to this PR. Recently trunk nightlies don't seem to have nearly as many issues, e.g. http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2018-02-27--001.1519737745--apache--trunk--f26fbb9/report.html, though this was started a bit ago so maybe tests were previously failing. Maybe try merging/rebasing and I can kick off another test run? And in the meantime take a look at the logs from the suspicious failures and see if there are any obvious problems?
@ewencp thanks for sharing the results, I rebased and pushed changes. I will have a closer look on the failures and try them out locally soon.
Managed to reproduce the issue: tests/docker/ducker-ak test ./tests/kafkatest/tests/core/mirror_maker_test.py failed locally too.
I found a way to fix the issue. We can make changes in kafka.py:
- use the
jmx_portof the broker for health checks. In effect, all MirrorMaker tests have passed. It was not recommended, however, it looks like it works at least in these tests. @ewencp, what do you think?
Also tried:
- added an extra check to the open_port(self, protocol) function to set self.broker_port only once, in the first function call.
if self.broker_port == -1:
self.broker_port = self.port_mappings[protocol].number
First 3 testcases have passed, but then the remaining tests started to fail (from kafkatest.tests.core.mirror_maker_test.TestMirrorMakerService.test_bounce.clean_shutdown=True.security_protocol=SASL_SSL.new_consumer=True).
More information
Using the port created by self.open_port(self.interbroker_security_protocol) is not working for some reason. Looking at the logs (TestMirrorMakerService/test_bounce/clean_shutdown=False.security_protocol=SASL_PLAINTEXT.new_consumer=True//1/test_log.debug):
[DEBUG - 2018-02-28 13:46:52,673 - remoteaccount - _log - lineno:160]: ducker@ducker04: Running ssh command: nc -z ducker04 9094
[DEBUG - 2018-02-28 13:46:52,743 - util - listening - lineno:132]: Server does not accept connections at: 'ducker04:9094')
...
[DEBUG - 2018-02-28 13:47:22,627 - util - listening - lineno:132]: Server does not accept connections at: 'ducker04:9094')
[INFO - 2018-02-28 13:47:22,887 - runner_client - log - lineno:230]: RunnerClient: kafkatest.tests.core.mirror_maker_test.TestMirrorMakerService.test_bounce.clean_shutdown=False.security_protocol=SASL_PLAINTEXT.new_consumer=True: FAIL: Kafka server didn't finish startup
@asasvari The problem with using the jmx_port is that it may be opened significantly before the port(s) that Kafka listens on just due to the order in which things run. Using the jmx_port may appear to work, but would result in a flaky test if the Kafka server was slow to start up between JMX starting and the broker opening the port for listening.
It sounds like we may need to handle the case with multiple ports better. I'm not sure about the specific example you gave with more information, but perhaps there is some reason it doesn't open that port? We'd probably need to take a look at a specific example that's failing, understand what the test does, and what the expected behavior for open ports would be.
In general, I would actually prefer to check all the ports we expect to be open rather than just one, as that ensures we've fully completed startup.
retest this please
@ewencp anything preventing this from being merged?
@ijuma not really, it has just been sitting in my review backlog until I find time to revisit the final changes
@ewencp Feel free to take it over.
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.
If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).
If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.
This PR has been closed since it has not had any activity in 120 days. If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open the PR and ask for a review.