scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

feature(artifact_test): testing the output of perftune.py

Open ShlomiBalalis opened this issue 2 years ago • 15 comments

perftune.py is a script within seastar (which is a part of scylla_setup) that configures several system parameters, like the amount of CPU cores that are designated to handle IRQ, and which are not.

I have recorded the expected results as of master 20230717.567b4536892f in defaults/perftune_results.json, and the test will compare them to the output of the script.

https://github.com/scylladb/scylla-enterprise/issues/2909

PR pre-checks (self review)

  • [x] I followed KISS principle and best practices
  • [x] I didn't leave commented-out/debugging code
  • [x] I added the relevant backport labels
  • [ ] New configuration option are added and documented (in sdcm/sct_config.py)
  • [ ] I have added tests to cover my changes (Infrastructure only - under unit-test/ folder)
  • [ ] All new and existing unit tests passed (CI)
  • [ ] I have updated the Readme/doc folder accordingly (if needed)

ShlomiBalalis avatar May 28 '23 17:05 ShlomiBalalis

Job I use to test the feature: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/Shlomo/job/artifacts-ami-test_perftune/ So far, the failure is due to mismatch with the expected result written in the doc to the result in practice. @vladzcloudius Can you please check my comments?

ShlomiBalalis avatar May 30 '23 12:05 ShlomiBalalis

Job I use to test the feature: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/Shlomo/job/artifacts-ami-test_perftune/ So far, the failure is due to mismatch with the expected result written in the doc to the result in practice. @vladzcloudius Can you please check my comments?

@ShlomiBalalis was it related to what we discussed on GoogleChat with you and @roydahan or was it something different?

Please, let me know if you still want me to take a look at anything at the moment?

vladzcloudius avatar Jul 20 '23 03:07 vladzcloudius

Job I use to test the feature: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/Shlomo/job/artifacts-ami-test_perftune/ So far, the failure is due to mismatch with the expected result written in the doc to the result in practice. @vladzcloudius Can you please check my comments?

@ShlomiBalalis was it related to what we discussed on GoogleChat with you and @roydahan or was it something different?

Please, let me know if you still want me to take a look at anything at the moment?

No, it's a very old comment on a very old PR. I will update the PR in a moment

ShlomiBalalis avatar Jul 24 '23 14:07 ShlomiBalalis

@roydahan ping

ShlomiBalalis avatar Jul 26 '23 09:07 ShlomiBalalis

In https://github.com/scylladb/scylla-pkg/pull/3500, new instance types were added to AWS and GCE. In the latest push, I added the results of perftune on said machines

ShlomiBalalis avatar Jul 26 '23 17:07 ShlomiBalalis

Please check the failing checks. other than that, looks good to me.

This is the only failure. Doesn't seem to be related to my PR

[2023-07-26T17:41:33.825Z] _______________________ TestFileLogger.test_file_logger ________________________
[2023-07-26T17:41:33.825Z] 
[2023-07-26T17:41:33.825Z] self = <unit_tests.test_sct_events_file_logger.TestFileLogger testMethod=test_file_logger>
[2023-07-26T17:41:33.825Z] 
[2023-07-26T17:41:33.825Z]     def tearDown(self) -> None:
[2023-07-26T17:41:33.825Z] >       self.file_logger.stop(timeout=1)
[2023-07-26T17:41:33.825Z] 
[2023-07-26T17:41:33.825Z] unit_tests/test_sct_events_file_logger.py:41: 
[2023-07-26T17:41:33.825Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2023-07-26T17:41:33.825Z] 
[2023-07-26T17:41:33.825Z] self = <EventsFileLogger name='EventsFileLogger-117' pid=3104 parent=10 stopped exitcode=0 daemon>
[2023-07-26T17:41:33.825Z] timeout = 1
[2023-07-26T17:41:33.825Z] 
[2023-07-26T17:41:33.825Z]     def stop(self, timeout: float = None) -> None:
[2023-07-26T17:41:33.825Z]         events_process_class_name = self.__class__.__name__
[2023-07-26T17:41:33.825Z]         LOGGER.debug("Stopping events process %s", events_process_class_name)
[2023-07-26T17:41:33.825Z]         self.terminate()
[2023-07-26T17:41:33.825Z]         LOGGER.debug("Waiting for events process %s to stop", events_process_class_name)
[2023-07-26T17:41:33.825Z]         self.join(timeout)  # pylint: disable=no-member; both threading.Thread and multiprocessing.Process have it
[2023-07-26T17:41:33.825Z]         if self.is_alive():  # pylint: disable=no-member;
[2023-07-26T17:41:33.825Z]             LOGGER.error("Events process %s is still alive after timeout", events_process_class_name)
[2023-07-26T17:41:33.825Z] >           assert False, f"Events process {events_process_class_name} is still alive after timeout"
[2023-07-26T17:41:33.825Z] E           AssertionError: Events process EventsFileLogger is still alive after timeout
[2023-07-26T17:41:33.825Z] 
[2023-07-26T17:41:33.825Z] sdcm/sct_events/events_processes.py:104: AssertionError

ShlomiBalalis avatar Jul 27 '23 14:07 ShlomiBalalis

@vladzcloudius , could you help us here re-reviewing this PR?

fgelcer avatar Jul 30 '23 10:07 fgelcer

@vladzcloudius already approved that the results on master can be treated as the expected results. I don't think we should wait more with this.

roydahan avatar Jul 31 '23 18:07 roydahan

@vladzcloudius already approved that the results on master can be treated as the expected results. I don't think we should wait more with this.

My apologies, guys. Too much fires to put down at the same time. Yes, please, move forward for now. I'll review later and if issues found we can send a follow up fix.

vladzcloudius avatar Jul 31 '23 23:07 vladzcloudius

...and there is a comment already. ;)

vladzcloudius avatar Aug 01 '23 01:08 vladzcloudius

This PR is just getting rotten here. @ShlomiBalalis is not available.

We need someone to finalize the work. If we don't do it, I'm going to merge it and we will handle the failures as part of master.

roydahan avatar Nov 22 '23 10:11 roydahan

This PR is just getting rotten here. @ShlomiBalalis is not available.

We need someone to finalize the work. If we don't do it, I'm going to merge it and we will handle the failures as part of master.

good luck with that

fruch avatar Nov 22 '23 12:11 fruch

reminder to self: I ran https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/Shlomo/job/2023.1-artifacts-ami-arm-test/4/ and all tests passed. Need to go over the log and make sure it did what we expected it to do.

roydahan avatar Dec 18 '23 20:12 roydahan

@ShlomiBalalis please try to get back to it, so we will merge it.

roydahan avatar Jan 18 '24 14:01 roydahan

@ShlomiBalalis please return to it and update these tests so we can merge them.

roydahan avatar Jul 24 '24 10:07 roydahan

rebased it, and giving a quick go in few jobs:

  • [ ] 🔴 https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/artifacts-docker-test/29/
  • [ ] 🔴 https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/artifacts-ubuntu2204-test/62/
  • [ ] 🔴 https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/artifacts-ubuntu2404-test/11/

fruch avatar Sep 19 '24 19:09 fruch

seems like it's currently broken and fail on some permission errors regardless CI isn't working on this PR, since the owner isn't in scylla anymore.

opening a new one: https://github.com/scylladb/scylla-cluster-tests/pull/8788

fruch avatar Sep 19 '24 19:09 fruch