openQA
openQA copied to clipboard
Verify when maintenance update has already been released
Before cloning or restarting a job, verify the maintenance update has already been released.
- Related ticket: https://progress.opensuse.org/issues/109707
The concept of SLE maintenance tests are not known to openQA and seems like you managed to fix a valid problem but the generic openQA clone job script is certainly not the right approach. Maybe think about a downstream script that can be used within the tooling
Thank you Oliver for the suggestion. Could you please suggest me where can we implement this change. I thought of adding this change at the beginning of openQA clone jobs to detect when maintenance updates have been released before downloading the assets for the execution.
Thank you Oliver for the suggestion. Could you please suggest me where can we implement this change.
I am thinking of three possible approaches:
- As openQA already provides a warning if a mandatory "asset" is missing likely the best approach would be to ensure that according repos are available on openQA itself so that a consistent test base is available for the lifetime of a test. As soon as that asset (e.g. a repo) is pruned from openQA by the automatic cleanup then any clone attempt would be prevented with an informative message about the unavailability of the asset. So that would mean there would be no need to do any changes in openQA itself but change how openQA syncs assets and triggers tests, likely in https://github.com/openSUSE/qem-bot/
- develop the SLE maintenance specific handling as custom "plugin" code for openqa CLI applications. That's likely the hardest but we already have a plugin structure for openQA CLI applications so in the end this might be the best way to go. What likely would be necessary is to rework openqa-clone-job to reuse what openqa-cli has in the backend and then implement custom commands in a separate repo together with the complete deployment structure, e.g. a separate openSUSE package that can be installed along with the package openQA-client
- Prepare a separate project where you have a new application that internally calls openqa-clone-job and on top does what you do here.
I thought of adding this change at the beginning of openQA clone jobs to detect when maintenance updates have been released before downloading the assets for the execution.
Sure. Let me ask, why does anybody care about cloning openQA jobs for maintenance updates that have already been released in the first place?
Sometimes job fails when we restart or clone it while verifying the test, because the updates have already been released. So team decided to add a check before running a job. I will discuss with team and come up with a reasonable solution. Thanks Oliver for the feedback and suggestion :)
Also note that the ticket says "we should have a warning". The "die" is a little bit more than a warning.
Sure. Let me ask, why does anybody care about cloning openQA jobs for maintenance updates that have already been released in the first place?
Sometimes job fails when we restart or clone it while verifying the test, because the updates have already been released. So team decided to add a check before running a job.
What in particular do you mean with "while verifying the test"?
Sure. Let me ask, why does anybody care about cloning openQA jobs for maintenance updates that have already been released in the first place?
Sometimes job fails when we restart or clone it while verifying the test, because the updates have already been released. So team decided to add a check before running a job.
What in particular do you mean with "while verifying the test"?
According to my knowledge, we re-execute/clone the failed openQA jobs of particular build(may be 2 days ago build) to identify the failure reasons.
According to my knowledge, we re-execute/clone the failed openQA jobs of particular build(may be 2 days ago build) to identify the failure reasons.
which wouldn't reproduce a consistent same result if the maintenance update was released meanwhile
According to my knowledge, we re-execute/clone the failed openQA jobs of particular build(may be 2 days ago build) to identify the failure reasons.
which wouldn't reproduce a consistent same result if the maintenance update was released meanwhile
Exactly this is what we want to avoid.
This is something we are facing in maintenance tests but also affects Tumbleweed when snapshot has been released. The idea of this PR is to reduce the time of failure from possible minutes to seconds by checking the availability of repository earlier and for time being we will start with maintenance then we will add same logic to other products. May be lets jump on call with @foursixnine to find the common solution.
One Idea is to add the changes in openqa-clone-custom-git-refspec.
One Idea is to add the changes in openqa-clone-custom-git-refspec.
Well, openqa-clone-custom-git-refspec is using openqa-clone-job and the problem would apply the same for openqa-clone-job so I don't see that as beneficial.
One Idea is to add the changes in openqa-clone-custom-git-refspec.
Well, openqa-clone-custom-git-refspec is using openqa-clone-job and the problem would apply the same for openqa-clone-job so I don't see that as beneficial.
Yeah, this is just one idea, however in the general sense the changes would need to go one way or the other into the clone job logic, since adding more wrappers or following any of the approaches you mentioned earlier will result in unnecessary complexity.
I do agree however, that it's not a generic problem, but it's a problem that both, SUSE and openSUSE face, and benefits mostly test developers.
@okurz now if you have an idea of how plugins for clone-job can be written, I'd appreciate direct hints on where to do it, so that other communities can also benefit too ;)
I do agree however, that it's not a generic problem, but it's a problem that both, SUSE and openSUSE face, and benefits mostly test developers.
By test developers I mean @os-autoinst/tests-maintainer and others who from time to time have to figure out why a job is failing, when the same job passed minutes earlier... only to figure out that it's due to a released snapshot or a released maintenance update
By test developers I mean @os-autoinst/tests-maintainer and others who from time to time have to figure out why a job is failing, when the same job passed minutes earlier... only to figure out that it's due to a released snapshot or a released maintenance update
Why not just gracefully walk over non-existant repos within the test code?
Let me go back one step. If you want …
- openQA to fail quickly or even prevent you from cloning a job that is missing a mandatory asset then make those assets known to openQA, see http://open.qa/docs/#_specifying_assets_required_by_a_job
- openSUSE/SLE tests to gracefully handle non-existant maintenance update repos then change https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/qam.pm#L76 to either catch zypper failures for non-existant repos or check for the existance of repos before adding. The second might be easier to implement but is not race-free.
By test developers I mean @os-autoinst/tests-maintainer and others who from time to time have to figure out why a job is failing, when the same job passed minutes earlier... only to figure out that it's due to a released snapshot or a released maintenance update
Why not just gracefully walk over non-existant repos within the test code?
That's a step further, since one might end up hitting the retry button on the webui :), however, it's already too late... specially when using instances where there are a lot of jobs already scheduled, I guess you know the drill
That's a step further, since one might end up hitting the retry button on the webui :)
sorry, I don't understand. What do you mean?
In the meantime I researched a bit of history.
During the work on https://progress.opensuse.org/issues/72121
- https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/11122 has been developed along with a workflow description in https://confluence.suse.com/display/%7Eph03nix/Schedule+QEM+Test+Verification+runs
- https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/11153 was extending the concept with not adding released or declined repositories
- https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/11836 removed that (partially) as
- https://jira.suse.com/browse/OBS-84 ensured to have repos stay around "a little bit longer" but maybe not long enough
However https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/console/cleanup_qam_testrepos.pm#L36 is still around and looks like it could help a lot. Of course one needs to set explicit variables to control the behaviour but that can be considered a safer choice than accepting empty repos as that can alter the behaviour.
We were discussing in the QE tools team and I am sure we have found a good suggestion now. openQA already has the feature to check on asset existance and provide feedback to users, e.g. preventing a restart but allowing to force a restart when users still want and similar for calls to openqa-clone-job. But here obviously it is about external repo URLs which are not synced to openQA and do not need to be synced because each incident repository can be considered a consistent state at least as long as it exists. If you would like it to stay around for longer of course, same as we do for other assets, you could sync it to openQA. But let's start with making repository URLs probable to check if they exist and are accessible before starting the actual test. For this what one could do is following http://open.qa/docs/#_specifying_assets_required_by_a_job) extending the concept of URL treatment, e.g. introduce a special suffix _CHECK_URL and check each URL (not download it) in the setup phase and only start tests if probing is successful. The probing can not ensure that the repository stays around until the test does not rely on it anymore but your original approach was similar in that regard.
EDIT: After talking with foursixnine there is one thing I would like to add which is that we would like the URL probing to also happen on trigger time, so synchronously from openqa-cli/openqa-clone-job so that users would be informed immediately about unavailable URLs. (Here I would say it's ok if the URLs are probed from the client's machine, not the openQA workers even though this might differ in behaviour)
@DeepthiYV moreover on a general note: Members from the team are happy to help and I support that they also join you and others in more close collaboration where seen more efficient. So it's fine if you just want to research, dig deeper and try yourself. But at any point you can also ask us for something like a chat with rapid feedback time, pair-programming sessions or code walkthrough, e.g. in a video call.
https://progress.opensuse.org/issues/112349 about "Provide an option to always clone the latest version of the scenario run by the job" is related to what we suggested here.
Codecov Report
Merging #4693 (ef2ad78) into master (d04080d) will increase coverage by
0.00%
. The diff coverage is100.00%
.
:exclamation: Current head ef2ad78 differs from pull request most recent head dcfad6c. Consider uploading reports for the commit dcfad6c to get more accurate results
@@ Coverage Diff @@
## master #4693 +/- ##
=======================================
Coverage 98.08% 98.08%
=======================================
Files 375 377 +2
Lines 34808 34825 +17
=======================================
+ Hits 34142 34159 +17
Misses 666 666
Impacted Files | Coverage Δ | |
---|---|---|
lib/OpenQA/Script/CloneJob.pm | 97.15% <100.00%> (+0.03%) |
:arrow_up: |
lib/OpenQA/Script/CloneJobSUSE.pm | 100.00% <100.00%> (ø) |
|
t/35-script_clone_job_suse.t | 100.00% <100.00%> (ø) |
|
t/ui/01-list.t | 99.15% <0.00%> (-0.05%) |
:arrow_down: |
lib/OpenQA/WebAPI/Controller/Test.pm | 94.60% <0.00%> (-0.03%) |
:arrow_down: |
lib/OpenQA/Worker/WebUIConnection.pm | 94.55% <0.00%> (-0.03%) |
:arrow_down: |
t/config.t | 100.00% <0.00%> (ø) |
|
t/10-jobs.t | 100.00% <0.00%> (ø) |
|
lib/OpenQA/Setup.pm | 100.00% <0.00%> (ø) |
|
t/ui/23-audit-log.t | 100.00% <0.00%> (ø) |
|
... and 3 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Hi @DeepthiYV, I saw that you made some minor changes. If you apply changes and it's not obvious which parts you addressed and which you still need help with to provide an explicit message stating what you updated since the last review state but also to state the open points. Then we could more easily help you with that as well. Please understand that only patching little things will likely not make the pull request merging. We need a design change. See https://github.com/os-autoinst/openQA/pull/4693#issuecomment-1154964288 . We must not have openQA care about SUSE SLE maintenance updates in particular, that would be bad design. There are several ways to address this.
Hi @DeepthiYV, I saw that you made some minor changes. If you apply changes and it's not obvious which parts you addressed and which you still need help with to provide an explicit message stating what you updated since the last review state but also to state the open points. Then we could more easily help you with that as well. Please understand that only patching little things will likely not make the pull request merging. We need a design change. See #4693 (comment) . We must not have openQA care about SUSE SLE maintenance updates in particular, that would be bad design. There are several ways to address this.
Thanks for the feedback. I think I’ll need your support for these changes . Can we have a short call to discuss this.
I think we should not accept SLE specific code into the main functionality flow of the client or clone-job. Also no bigger code blocks without automated tests. Going forward with a call to discuss details is one possibility to do. At the very least although not the best solution could be as simple as a separate file like lib/OpenQA/Script/CloneJobSUSE.pm that is loaded from CloneJob. I would very much prefer what I suggested in #4693 (comment). If you want to go ahead with that I suggest to look into the existing code that handles
_DECOMPRESS_URL
. That could be a good starting point to implement something like the_CHECK_URL
functionality I mentioned.
I will start working with the solution: Loading a separate file from Clone Job. Other suggested solution on the lines of "_DECOMPRESS_URL" looks complex.
ok, I strongly suggest using test driven development to do further implementation
ok, I strongly suggest using test driven development to do further implementation
I had done changes in separate file CloneJobSUSE.pm that is loaded from CloneJob with unit test. I verified the changes on few jobs of O3 and OSD. Kindly review and provide your feedback.
@DeepthiYV as soon as you have implemented https://github.com/os-autoinst/openQA/pull/4693#discussion_r924426323 and do not plan to do more changes please mark the PR as ready for review as it's still in draft state
@mimi1vx I think this is still pending your review
still same problem ... using *_TEST_TEMPLATE
and _TEST_ISSUE
variables ... its problem. Really used is only part of this because they aree used to compute MAINT_TEST_REPO
var using SCC_ADDONS
but only in aggregate jobs, for incidents there is INCIDENT_REPO
var
still same problem ... using
*_TEST_TEMPLATE
and_TEST_ISSUE
variables ... its problem. Really used is only part of this because they aree used to computeMAINT_TEST_REPO
var usingSCC_ADDONS
but only in aggregate jobs, for incidents there isINCIDENT_REPO
var @mimi1vx, I think, we can make this changes only after the implementation of the ticket https://progress.opensuse.org/issues/111710. @foursixnine and I were already discussed and we will start working on this ticket from next week.
@mimi1vx, I think, we can make this changes only after the implementation of the ticket https://progress.opensuse.org/issues/111710. @foursixnine and I were already discussed and we will start working on this ticket from next week.
no described problem is real now, even without poo#11710 and after implementation of this poo it will be even worse. by using correct vars:
- there isn't need to reconstruct URL from ISSUE + template because MAINT_TEST_REPO and INCIDENT_REPO contains this URL's
- you will be checking only valid incidents for given job, not all possible incidents
- it will be future-proof
current implementation is simply wrong
I discussed this with @mimi1vx . I liked his suggestion. I shall first implement the ticket 111710 and then I will make changes to this PR.
@mimi1vx Can you please review my latest changes.