e2e-testing Audit test pipelines

trafficstars

This is a master tracking issue issue for auditing which E2E test pipelines need to remain enabled.

Beats CI pipelines

Pipeline	Main Health	Triggers	Stakeholders	Issue(s)	Removal planned
Docker images	⭕ Stale	None	Robots [@cachedout and @kuisathaverat ]		❌
Fleet E2E	🔴 Broken	Daily build	Fleet [@joshdover]	https://github.com/elastic/elastic-agent/issues/1174
Observability Helm Charts	🟢 Healthy	Daily build	Robots [@cachedout and @kuisathaverat ]	Issue located in private repo	✅
K8S Autodiscover	🟡 Flakey	Daily build	Cloud Native Monitoring [@gizas]
Observability MacOS	🔴 Broken	Daily build	Elastic Agent [@cmacknz and @jlind23 ]	https://github.com/elastic/ci/issues/705
Fleet Server	⭕ Stale	None	Fleet [@joshdover ]	https://github.com/elastic/fleet-server/issues/1927
Fleet UI	⭕ Stale	None	Fleet and Integrations [@kpollich ]

Fleet CI pipelines

Pipeline	Main Health	Triggers	Stakeholders	Issue
Pipeline helper	🔴 Broken	Push to main; PR labeled	Elastic Agent[@cmacknz and @jlind23 ]	https://github.com/elastic/elastic-agent/issues/1174

⚠️ If you are listed as a stakeholder, we would like to know the following:

Should the pipeline be removed from the CI or should it remain? 1.1 If the pipeline remains and is broken, what is the link to an issue tracking a fix? 1.2 If the pipeline should remain, how is it monitored by the team to ensure that build artifacts are not produced when the tests fail?

Next steps

Proposed pipeline criteria

I am proposing that we remove all pipelines which do not meet any of the following criteria:

Necessary for the ongoing health of the E2E test suite itself
Used by a product team as a quality gateway. Concretely, this means that a failing test blocks a PR from being merged or a build artifact from being produced.
Exist to ensure the quality of a supported product.

Timeline

All existing E2E pipelines have stakeholders assigned no later than: October 1, 2022
All stakeholder agree upon proposed pipeline criteria no later than: October 20, 2022
Non-confirming pipelines will be removed from Jenkins and code will be removed from the E2E test suite beginning on: Nov 1st, 2022

Related efforts

There is a separate effort to try and reduce the scope of E2E testing back to a point where stability can be maintained, but it is limited to tests for the Agent. That effort can be found here: https://github.com/elastic/elastic-agent/issues/1174

Sep 28 '22 09:09 cachedout

For the Macos Daily -> it was originally implemented in https://github.com/elastic/e2e-testing/pull/2626, and using the Orka ephemeral workers, and superseded https://github.com/elastic/e2e-testing/pull/2336

the error is something the @elastic/ci-systems might need to help with:

[2022-09-28T04:58:54.763Z] + .ci/scripts/deployment.sh create
[2022-09-28T04:58:54.887Z] Cloning into '.obs'...
[2022-09-28T04:58:55.095Z] Host key verification failed.
[2022-09-28T04:58:55.095Z] fatal: Could not read from remote repository.
[2022-09-28T04:58:55.095Z] 
[2022-09-28T04:58:55.095Z] Please make sure you have the correct access rights
[2022-09-28T04:58:55.095Z] and the repository exists.

IIUC, the recent upgrade in the CI controllers added a host key verification by default, we reported this in the past and it was partially fixed since we dont' see the below error but a new one:

but the error now happens in a subsequent stage to clone a private repository -- see the above console log

It worked in the past

Sep 28 '22 10:09 v1v

Docker images generated the Systemd Docker images used in the e2e tests, probably we are the stakeholders.

Sep 28 '22 10:09 kuisathaverat

@v1v Thanks, that helps. I'm also trying to figure out what it actually does so that I can figure out how the stakeholders should be. I'm code-diving right now a bit to try and get a sense of that.

Sep 28 '22 10:09 cachedout

Observability Helm Charts can be removed

Sep 28 '22 10:09 kuisathaverat

@kuisathaverat Thanks! Regarding the Docker images -- that pipeline hasn't been executed for over a year. Does it still need to exist?

Sep 28 '22 10:09 cachedout

Does it still need to exist?

It is the only way to generate those images, when they change should be executed. These images are for making a test on installation on a systems environment. The main changes that can have are bumping the systems version or the Linux version.

Sep 28 '22 10:09 kuisathaverat

@cmacknz and @jlind23 Are you tracking any issues for the flakiness in the K8s Autodiscover pipeline?

Sep 28 '22 11:09 cachedout

@v1v Thanks, that helps. I'm also trying to figure out what it actually does so that I can figure out how the stakeholders should be. I'm code-diving right now a bit to try and get a sense of that.

There was an original request to test on MacOS, for such, it was initially attempted with the AWS MacOS, but it was declined for vary reasons:

Cost, IIRC, machines will be created and pay for 24 hours minimal, see https://github.com/elastic/e2e-testing/pull/2336#issuecomment-1111883732
Implementation, the Ansible ec2 integration didn't work well , see https://github.com/elastic/e2e-testing/pull/2336#issuecomment-1118315327
Ephemeral Orkas were available. see https://github.com/elastic/e2e-testing/pull/2336#issuecomment-1147931612

I guess the stakeholder might be @jlind23 as he was the original requester for the MacOS in AWS

Sep 28 '22 11:09 v1v

@cachedout this is the issue we will use for the first half of 8.6. @AndersonQ is already assigned to this and will closely work with you in order to get back to a better place.

Sep 28 '22 11:09 jlind23

@jlind23 That link seems wrong? :)

Sep 28 '22 11:09 cachedout

Sorry, this one - https://github.com/elastic/elastic-agent/issues/1174

Sep 28 '22 12:09 jlind23

@cmacknz and @jlind23 Are you tracking any issues for the flakiness in the K8s Autodiscover pipeline?

No, it may make sense to follow up with the Observability Cloudnative monitoring team to see if they have interest in fixing these tests faster than the agent team can get to them. They have done the majority of the recent work for autodiscovery features in agent.

Sep 28 '22 18:09 cmacknz

No, it may make sense to follow up with the Observability Cloudnative monitoring team to see if they have interest in fixing these tests faster than the agent team can get to them.

Looping in @gizas . We are trying to stabilize the E2E test suite. Are you aware of the flakiness in the k8s autodiscover tests, and if so, is anybody on your time investigating them?

Sep 29 '22 11:09 cachedout

I have disabled most of the tests in the Fleet E2E suite while we eval what to do with the remaining: https://github.com/elastic/elastic-agent/issues/1174#issuecomment-1267023078

Oct 04 '22 13:10 joshdover

Sorry for delayed answer, @cachedout , @cmacknz just checking K8s Autodiscover pipeline. Can you point me to a fail instance to have a look?

Indeed in the past we had provided some fixes

Oct 05 '22 11:10 gizas

Observability Helm Charts can be removed

Any reason this hasn't been done yet? Seeing it fail on a few PR runs recently and couldn't find the issue to track removing these.

Oct 17 '22 11:10 joshdover

Any reason this hasn't been done yet?

Hi @joshdover . The issue is this one: https://github.com/elastic/observability-robots/issues/1325

We were considering this blocked until they sorted out the future regarding charts, but TBH it's probably not a big deal if we just pull it out now if it's failing in PRs. LMK what you think.

Oct 17 '22 12:10 cachedout

Makes sense. I've only seen it fail once recently, but will flag it if it's more of a problem.

On Mon, Oct 17, 2022 at 2:01 PM Mike Place @.***> wrote:

Any reason this hasn't been done yet?

Hi @joshdover https://github.com/joshdover . The issue is this one: elastic/observability-robots#1325 https://github.com/elastic/observability-robots/issues/1325

We were considering this blocked until they sorted out the public communication regarding chart deprecation, but TBH it's probably not a big deal if we just pull it out now if it's failing in PRs. LMK what you think.

— Reply to this email directly, view it on GitHub https://github.com/elastic/e2e-testing/issues/3053#issuecomment-1280746306, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN2UEF52N6EMXGDFWEMIPTWDU5YFANCNFSM6AAAAAAQXTAD34 . You are receiving this because you were mentioned.Message ID: @.***>

Oct 17 '22 12:10 joshdover

e2e-testing e2e-testing copied to clipboard

Audit test pipelines

Beats CI pipelines

Fleet CI pipelines

Next steps

Proposed pipeline criteria

Timeline

Related efforts

e2e-testing
e2e-testing copied to clipboard