e2e-testing icon indicating copy to clipboard operation
e2e-testing copied to clipboard

Audit test pipelines

Open cachedout opened this issue 2 years ago • 18 comments

This is a master tracking issue issue for auditing which E2E test pipelines need to remain enabled.

Beats CI pipelines

Pipeline Main Health Triggers Stakeholders Issue(s) Removal planned
Docker images ⭕ Stale None Robots [@cachedout and @kuisathaverat ]
Fleet E2E 🔴 Broken Daily build Fleet [@joshdover] https://github.com/elastic/elastic-agent/issues/1174
Observability Helm Charts 🟢 Healthy Daily build Robots [@cachedout and @kuisathaverat ] Issue located in private repo
K8S Autodiscover 🟡 Flakey Daily build Cloud Native Monitoring [@gizas]
Observability MacOS 🔴 Broken Daily build Elastic Agent [@cmacknz and @jlind23 ] https://github.com/elastic/ci/issues/705
Fleet Server ⭕ Stale None Fleet [@joshdover ] https://github.com/elastic/fleet-server/issues/1927
Fleet UI ⭕ Stale None Fleet and Integrations [@kpollich ]

Fleet CI pipelines

Pipeline Main Health Triggers Stakeholders Issue
Pipeline helper 🔴 Broken Push to main; PR labeled Elastic Agent[@cmacknz and @jlind23 ] https://github.com/elastic/elastic-agent/issues/1174

⚠️ If you are listed as a stakeholder, we would like to know the following:

  1. Should the pipeline be removed from the CI or should it remain? 1.1 If the pipeline remains and is broken, what is the link to an issue tracking a fix? 1.2 If the pipeline should remain, how is it monitored by the team to ensure that build artifacts are not produced when the tests fail?

Next steps

Proposed pipeline criteria

I am proposing that we remove all pipelines which do not meet any of the following criteria:

  1. Necessary for the ongoing health of the E2E test suite itself
  2. Used by a product team as a quality gateway. Concretely, this means that a failing test blocks a PR from being merged or a build artifact from being produced.
  3. Exist to ensure the quality of a supported product.

Timeline

  1. All existing E2E pipelines have stakeholders assigned no later than: October 1, 2022
  2. All stakeholder agree upon proposed pipeline criteria no later than: October 20, 2022
  3. Non-confirming pipelines will be removed from Jenkins and code will be removed from the E2E test suite beginning on: Nov 1st, 2022

Related efforts

There is a separate effort to try and reduce the scope of E2E testing back to a point where stability can be maintained, but it is limited to tests for the Agent. That effort can be found here: https://github.com/elastic/elastic-agent/issues/1174

cachedout avatar Sep 28 '22 09:09 cachedout

For the Macos Daily -> it was originally implemented in https://github.com/elastic/e2e-testing/pull/2626, and using the Orka ephemeral workers, and superseded https://github.com/elastic/e2e-testing/pull/2336

the error is something the @elastic/ci-systems might need to help with:

[2022-09-28T04:58:54.763Z] + .ci/scripts/deployment.sh create
[2022-09-28T04:58:54.887Z] Cloning into '.obs'...
[2022-09-28T04:58:55.095Z] Host key verification failed.
[2022-09-28T04:58:55.095Z] fatal: Could not read from remote repository.
[2022-09-28T04:58:55.095Z] 
[2022-09-28T04:58:55.095Z] Please make sure you have the correct access rights
[2022-09-28T04:58:55.095Z] and the repository exists.

IIUC, the recent upgrade in the CI controllers added a host key verification by default, we reported this in the past and it was partially fixed since we dont' see the below error but a new one:

image

but the error now happens in a subsequent stage to clone a private repository -- see the above console log

It worked in the past

image

v1v avatar Sep 28 '22 10:09 v1v

Docker images generated the Systemd Docker images used in the e2e tests, probably we are the stakeholders.

kuisathaverat avatar Sep 28 '22 10:09 kuisathaverat

@v1v Thanks, that helps. I'm also trying to figure out what it actually does so that I can figure out how the stakeholders should be. I'm code-diving right now a bit to try and get a sense of that.

cachedout avatar Sep 28 '22 10:09 cachedout

Observability Helm Charts can be removed

kuisathaverat avatar Sep 28 '22 10:09 kuisathaverat

@kuisathaverat Thanks! Regarding the Docker images -- that pipeline hasn't been executed for over a year. Does it still need to exist?

cachedout avatar Sep 28 '22 10:09 cachedout

Does it still need to exist?

It is the only way to generate those images, when they change should be executed. These images are for making a test on installation on a systems environment. The main changes that can have are bumping the systems version or the Linux version.

kuisathaverat avatar Sep 28 '22 10:09 kuisathaverat

@cmacknz and @jlind23 Are you tracking any issues for the flakiness in the K8s Autodiscover pipeline?

cachedout avatar Sep 28 '22 11:09 cachedout

@v1v Thanks, that helps. I'm also trying to figure out what it actually does so that I can figure out how the stakeholders should be. I'm code-diving right now a bit to try and get a sense of that.

There was an original request to test on MacOS, for such, it was initially attempted with the AWS MacOS, but it was declined for vary reasons:

  1. Cost, IIRC, machines will be created and pay for 24 hours minimal, see https://github.com/elastic/e2e-testing/pull/2336#issuecomment-1111883732
  2. Implementation, the Ansible ec2 integration didn't work well , see https://github.com/elastic/e2e-testing/pull/2336#issuecomment-1118315327
  3. Ephemeral Orkas were available. see https://github.com/elastic/e2e-testing/pull/2336#issuecomment-1147931612

I guess the stakeholder might be @jlind23 as he was the original requester for the MacOS in AWS

v1v avatar Sep 28 '22 11:09 v1v

@cachedout this is the issue we will use for the first half of 8.6. @AndersonQ is already assigned to this and will closely work with you in order to get back to a better place.

jlind23 avatar Sep 28 '22 11:09 jlind23

@jlind23 That link seems wrong? :)

cachedout avatar Sep 28 '22 11:09 cachedout

Sorry, this one - https://github.com/elastic/elastic-agent/issues/1174

jlind23 avatar Sep 28 '22 12:09 jlind23

@cmacknz and @jlind23 Are you tracking any issues for the flakiness in the K8s Autodiscover pipeline?

No, it may make sense to follow up with the Observability Cloudnative monitoring team to see if they have interest in fixing these tests faster than the agent team can get to them. They have done the majority of the recent work for autodiscovery features in agent.

cmacknz avatar Sep 28 '22 18:09 cmacknz

No, it may make sense to follow up with the Observability Cloudnative monitoring team to see if they have interest in fixing these tests faster than the agent team can get to them.

Looping in @gizas . We are trying to stabilize the E2E test suite. Are you aware of the flakiness in the k8s autodiscover tests, and if so, is anybody on your time investigating them?

cachedout avatar Sep 29 '22 11:09 cachedout

I have disabled most of the tests in the Fleet E2E suite while we eval what to do with the remaining: https://github.com/elastic/elastic-agent/issues/1174#issuecomment-1267023078

joshdover avatar Oct 04 '22 13:10 joshdover

Sorry for delayed answer, @cachedout , @cmacknz just checking K8s Autodiscover pipeline. Can you point me to a fail instance to have a look?

Indeed in the past we had provided some fixes

gizas avatar Oct 05 '22 11:10 gizas

Observability Helm Charts can be removed

Any reason this hasn't been done yet? Seeing it fail on a few PR runs recently and couldn't find the issue to track removing these.

joshdover avatar Oct 17 '22 11:10 joshdover

Any reason this hasn't been done yet?

Hi @joshdover . The issue is this one: https://github.com/elastic/observability-robots/issues/1325

We were considering this blocked until they sorted out the future regarding charts, but TBH it's probably not a big deal if we just pull it out now if it's failing in PRs. LMK what you think.

cachedout avatar Oct 17 '22 12:10 cachedout

Makes sense. I've only seen it fail once recently, but will flag it if it's more of a problem.

On Mon, Oct 17, 2022 at 2:01 PM Mike Place @.***> wrote:

Any reason this hasn't been done yet?

Hi @joshdover https://github.com/joshdover . The issue is this one: elastic/observability-robots#1325 https://github.com/elastic/observability-robots/issues/1325

We were considering this blocked until they sorted out the public communication regarding chart deprecation, but TBH it's probably not a big deal if we just pull it out now if it's failing in PRs. LMK what you think.

— Reply to this email directly, view it on GitHub https://github.com/elastic/e2e-testing/issues/3053#issuecomment-1280746306, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN2UEF52N6EMXGDFWEMIPTWDU5YFANCNFSM6AAAAAAQXTAD34 . You are receiving this because you were mentioned.Message ID: @.***>

joshdover avatar Oct 17 '22 12:10 joshdover