Before cloning or restarting a job, verify the maintenance update has already been released.

Related ticket: https://progress.opensuse.org/issues/109707

Jun 08 '22 11:06 DeepthiYV

The concept of SLE maintenance tests are not known to openQA and seems like you managed to fix a valid problem but the generic openQA clone job script is certainly not the right approach. Maybe think about a downstream script that can be used within the tooling

Thank you Oliver for the suggestion. Could you please suggest me where can we implement this change. I thought of adding this change at the beginning of openQA clone jobs to detect when maintenance updates have been released before downloading the assets for the execution.

Jun 08 '22 13:06 DeepthiYV

Thank you Oliver for the suggestion. Could you please suggest me where can we implement this change.

I am thinking of three possible approaches:

As openQA already provides a warning if a mandatory "asset" is missing likely the best approach would be to ensure that according repos are available on openQA itself so that a consistent test base is available for the lifetime of a test. As soon as that asset (e.g. a repo) is pruned from openQA by the automatic cleanup then any clone attempt would be prevented with an informative message about the unavailability of the asset. So that would mean there would be no need to do any changes in openQA itself but change how openQA syncs assets and triggers tests, likely in https://github.com/openSUSE/qem-bot/

develop the SLE maintenance specific handling as custom "plugin" code for openqa CLI applications. That's likely the hardest but we already have a plugin structure for openQA CLI applications so in the end this might be the best way to go. What likely would be necessary is to rework openqa-clone-job to reuse what openqa-cli has in the backend and then implement custom commands in a separate repo together with the complete deployment structure, e.g. a separate openSUSE package that can be installed along with the package openQA-client

Prepare a separate project where you have a new application that internally calls openqa-clone-job and on top does what you do here.

I thought of adding this change at the beginning of openQA clone jobs to detect when maintenance updates have been released before downloading the assets for the execution.

Sure. Let me ask, why does anybody care about cloning openQA jobs for maintenance updates that have already been released in the first place?

Sometimes job fails when we restart or clone it while verifying the test, because the updates have already been released. So team decided to add a check before running a job. I will discuss with team and come up with a reasonable solution. Thanks Oliver for the feedback and suggestion :)

Jun 09 '22 06:06 DeepthiYV

Also note that the ticket says "we should have a warning". The "die" is a little bit more than a warning.

Jun 09 '22 11:06 Martchus

Sure. Let me ask, why does anybody care about cloning openQA jobs for maintenance updates that have already been released in the first place?

Sometimes job fails when we restart or clone it while verifying the test, because the updates have already been released. So team decided to add a check before running a job.

What in particular do you mean with "while verifying the test"?

Jun 09 '22 13:06 okurz

Sure. Let me ask, why does anybody care about cloning openQA jobs for maintenance updates that have already been released in the first place?

Sometimes job fails when we restart or clone it while verifying the test, because the updates have already been released. So team decided to add a check before running a job.

What in particular do you mean with "while verifying the test"?

According to my knowledge, we re-execute/clone the failed openQA jobs of particular build(may be 2 days ago build) to identify the failure reasons.

Jun 09 '22 13:06 DeepthiYV

According to my knowledge, we re-execute/clone the failed openQA jobs of particular build(may be 2 days ago build) to identify the failure reasons.

which wouldn't reproduce a consistent same result if the maintenance update was released meanwhile

Jun 09 '22 14:06 okurz

According to my knowledge, we re-execute/clone the failed openQA jobs of particular build(may be 2 days ago build) to identify the failure reasons.

which wouldn't reproduce a consistent same result if the maintenance update was released meanwhile

Exactly this is what we want to avoid.

This is something we are facing in maintenance tests but also affects Tumbleweed when snapshot has been released. The idea of this PR is to reduce the time of failure from possible minutes to seconds by checking the availability of repository earlier and for time being we will start with maintenance then we will add same logic to other products. May be lets jump on call with @foursixnine to find the common solution.

One Idea is to add the changes in openqa-clone-custom-git-refspec.

Jun 09 '22 14:06 DeepthiYV

One Idea is to add the changes in openqa-clone-custom-git-refspec.

Well, openqa-clone-custom-git-refspec is using openqa-clone-job and the problem would apply the same for openqa-clone-job so I don't see that as beneficial.

Jun 09 '22 14:06 okurz

One Idea is to add the changes in openqa-clone-custom-git-refspec.

Well, openqa-clone-custom-git-refspec is using openqa-clone-job and the problem would apply the same for openqa-clone-job so I don't see that as beneficial.

Yeah, this is just one idea, however in the general sense the changes would need to go one way or the other into the clone job logic, since adding more wrappers or following any of the approaches you mentioned earlier will result in unnecessary complexity.

I do agree however, that it's not a generic problem, but it's a problem that both, SUSE and openSUSE face, and benefits mostly test developers.

@okurz now if you have an idea of how plugins for clone-job can be written, I'd appreciate direct hints on where to do it, so that other communities can also benefit too ;)

Jun 09 '22 14:06 foursixnine

I do agree however, that it's not a generic problem, but it's a problem that both, SUSE and openSUSE face, and benefits mostly test developers.

By test developers I mean @os-autoinst/tests-maintainer and others who from time to time have to figure out why a job is failing, when the same job passed minutes earlier... only to figure out that it's due to a released snapshot or a released maintenance update

Jun 09 '22 14:06 foursixnine

By test developers I mean @os-autoinst/tests-maintainer and others who from time to time have to figure out why a job is failing, when the same job passed minutes earlier... only to figure out that it's due to a released snapshot or a released maintenance update

Why not just gracefully walk over non-existant repos within the test code?

Jun 09 '22 14:06 okurz

Let me go back one step. If you want …

openQA to fail quickly or even prevent you from cloning a job that is missing a mandatory asset then make those assets known to openQA, see http://open.qa/docs/#_specifying_assets_required_by_a_job
openSUSE/SLE tests to gracefully handle non-existant maintenance update repos then change https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/qam.pm#L76 to either catch zypper failures for non-existant repos or check for the existance of repos before adding. The second might be easier to implement but is not race-free.

Jun 09 '22 15:06 okurz

By test developers I mean @os-autoinst/tests-maintainer and others who from time to time have to figure out why a job is failing, when the same job passed minutes earlier... only to figure out that it's due to a released snapshot or a released maintenance update

Why not just gracefully walk over non-existant repos within the test code?

That's a step further, since one might end up hitting the retry button on the webui :), however, it's already too late... specially when using instances where there are a lot of jobs already scheduled, I guess you know the drill

Jun 09 '22 15:06 foursixnine

That's a step further, since one might end up hitting the retry button on the webui :)

sorry, I don't understand. What do you mean?

In the meantime I researched a bit of history.

During the work on https://progress.opensuse.org/issues/72121

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/11122 has been developed along with a workflow description in https://confluence.suse.com/display/%7Eph03nix/Schedule+QEM+Test+Verification+runs
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/11153 was extending the concept with not adding released or declined repositories
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/11836 removed that (partially) as
https://jira.suse.com/browse/OBS-84 ensured to have repos stay around "a little bit longer" but maybe not long enough

However https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/console/cleanup_qam_testrepos.pm#L36 is still around and looks like it could help a lot. Of course one needs to set explicit variables to control the behaviour but that can be considered a safer choice than accepting empty repos as that can alter the behaviour.

Jun 09 '22 15:06 okurz

We were discussing in the QE tools team and I am sure we have found a good suggestion now. openQA already has the feature to check on asset existance and provide feedback to users, e.g. preventing a restart but allowing to force a restart when users still want and similar for calls to openqa-clone-job. But here obviously it is about external repo URLs which are not synced to openQA and do not need to be synced because each incident repository can be considered a consistent state at least as long as it exists. If you would like it to stay around for longer of course, same as we do for other assets, you could sync it to openQA. But let's start with making repository URLs probable to check if they exist and are accessible before starting the actual test. For this what one could do is following http://open.qa/docs/#_specifying_assets_required_by_a_job) extending the concept of URL treatment, e.g. introduce a special suffix _CHECK_URL and check each URL (not download it) in the setup phase and only start tests if probing is successful. The probing can not ensure that the repository stays around until the test does not rely on it anymore but your original approach was similar in that regard.

EDIT: After talking with foursixnine there is one thing I would like to add which is that we would like the URL probing to also happen on trigger time, so synchronously from openqa-cli/openqa-clone-job so that users would be informed immediately about unavailable URLs. (Here I would say it's ok if the URLs are probed from the client's machine, not the openQA workers even though this might differ in behaviour)

Jun 14 '22 09:06 okurz

@DeepthiYV moreover on a general note: Members from the team are happy to help and I support that they also join you and others in more close collaboration where seen more efficient. So it's fine if you just want to research, dig deeper and try yourself. But at any point you can also ask us for something like a chat with rapid feedback time, pair-programming sessions or code walkthrough, e.g. in a video call.

Jun 14 '22 19:06 okurz

https://progress.opensuse.org/issues/112349 about "Provide an option to always clone the latest version of the scenario run by the job" is related to what we suggested here.

Jun 14 '22 19:06 okurz

Codecov Report

Merging #4693 (ef2ad78) into master (d04080d) will increase coverage by 0.00%. The diff coverage is 100.00%.

:exclamation: Current head ef2ad78 differs from pull request most recent head dcfad6c. Consider uploading reports for the commit dcfad6c to get more accurate results

@@           Coverage Diff           @@
##           master    #4693   +/-   ##
=======================================
  Coverage   98.08%   98.08%           
=======================================
  Files         375      377    +2     
  Lines       34808    34825   +17     
=======================================
+ Hits        34142    34159   +17     
  Misses        666      666

Impacted Files	Coverage Δ
lib/OpenQA/Script/CloneJob.pm	`97.15% <100.00%> (+0.03%)`	:arrow_up:
lib/OpenQA/Script/CloneJobSUSE.pm	`100.00% <100.00%> (ø)`
t/35-script_clone_job_suse.t	`100.00% <100.00%> (ø)`
t/ui/01-list.t	`99.15% <0.00%> (-0.05%)`	:arrow_down:
lib/OpenQA/WebAPI/Controller/Test.pm	`94.60% <0.00%> (-0.03%)`	:arrow_down:
lib/OpenQA/Worker/WebUIConnection.pm	`94.55% <0.00%> (-0.03%)`	:arrow_down:
t/config.t	`100.00% <0.00%> (ø)`
t/10-jobs.t	`100.00% <0.00%> (ø)`
lib/OpenQA/Setup.pm	`100.00% <0.00%> (ø)`
t/ui/23-audit-log.t	`100.00% <0.00%> (ø)`
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Jun 21 '22 10:06 codecov[bot]

Hi @DeepthiYV, I saw that you made some minor changes. If you apply changes and it's not obvious which parts you addressed and which you still need help with to provide an explicit message stating what you updated since the last review state but also to state the open points. Then we could more easily help you with that as well. Please understand that only patching little things will likely not make the pull request merging. We need a design change. See https://github.com/os-autoinst/openQA/pull/4693#issuecomment-1154964288 . We must not have openQA care about SUSE SLE maintenance updates in particular, that would be bad design. There are several ways to address this.

Jun 22 '22 04:06 okurz

Hi @DeepthiYV, I saw that you made some minor changes. If you apply changes and it's not obvious which parts you addressed and which you still need help with to provide an explicit message stating what you updated since the last review state but also to state the open points. Then we could more easily help you with that as well. Please understand that only patching little things will likely not make the pull request merging. We need a design change. See #4693 (comment) . We must not have openQA care about SUSE SLE maintenance updates in particular, that would be bad design. There are several ways to address this.

Thanks for the feedback. I think I’ll need your support for these changes . Can we have a short call to discuss this.

Jun 22 '22 06:06 DeepthiYV

I think we should not accept SLE specific code into the main functionality flow of the client or clone-job. Also no bigger code blocks without automated tests. Going forward with a call to discuss details is one possibility to do. At the very least although not the best solution could be as simple as a separate file like lib/OpenQA/Script/CloneJobSUSE.pm that is loaded from CloneJob. I would very much prefer what I suggested in #4693 (comment). If you want to go ahead with that I suggest to look into the existing code that handles _DECOMPRESS_URL. That could be a good starting point to implement something like the _CHECK_URL functionality I mentioned.

I will start working with the solution: Loading a separate file from Clone Job. Other suggested solution on the lines of "_DECOMPRESS_URL" looks complex.

Jun 22 '22 14:06 DeepthiYV

ok, I strongly suggest using test driven development to do further implementation

Jun 22 '22 14:06 okurz

ok, I strongly suggest using test driven development to do further implementation

I had done changes in separate file CloneJobSUSE.pm that is loaded from CloneJob with unit test. I verified the changes on few jobs of O3 and OSD. Kindly review and provide your feedback.

Jul 13 '22 14:07 DeepthiYV

@DeepthiYV as soon as you have implemented https://github.com/os-autoinst/openQA/pull/4693#discussion_r924426323 and do not plan to do more changes please mark the PR as ready for review as it's still in draft state

Jul 19 '22 12:07 okurz

@mimi1vx I think this is still pending your review

Jul 27 '22 08:07 kalikiana

still same problem ... using *_TEST_TEMPLATE and _TEST_ISSUE variables ... its problem. Really used is only part of this because they aree used to compute MAINT_TEST_REPO var using SCC_ADDONS but only in aggregate jobs, for incidents there is INCIDENT_REPO var

Jul 27 '22 08:07 mimi1vx

still same problem ... using *_TEST_TEMPLATE and _TEST_ISSUE variables ... its problem. Really used is only part of this because they aree used to compute MAINT_TEST_REPO var using SCC_ADDONS but only in aggregate jobs, for incidents there is INCIDENT_REPO var @mimi1vx, I think, we can make this changes only after the implementation of the ticket https://progress.opensuse.org/issues/111710. @foursixnine and I were already discussed and we will start working on this ticket from next week.

Jul 27 '22 09:07 DeepthiYV

@mimi1vx, I think, we can make this changes only after the implementation of the ticket https://progress.opensuse.org/issues/111710. @foursixnine and I were already discussed and we will start working on this ticket from next week.

no described problem is real now, even without poo#11710 and after implementation of this poo it will be even worse. by using correct vars:

there isn't need to reconstruct URL from ISSUE + template because MAINT_TEST_REPO and INCIDENT_REPO contains this URL's
you will be checking only valid incidents for given job, not all possible incidents
it will be future-proof

current implementation is simply wrong

Jul 27 '22 09:07 mimi1vx

I discussed this with @mimi1vx . I liked his suggestion. I shall first implement the ticket 111710 and then I will make changes to this PR.

Aug 01 '22 11:08 DeepthiYV

@mimi1vx Can you please review my latest changes.

Aug 12 '22 09:08 DeepthiYV

openQA
openQA copied to clipboard

Verify when maintenance update has already been released

Codecov Report

openQA openQA copied to clipboard

Verify when maintenance update has already been released

Codecov Report

openQA
openQA copied to clipboard