ci-jenkins-pipelines Really need a build pipeline "stalled" detector?

Common problem is nightly/release builds get stalled because a vital node (eg.installer node) goes offline, and everything stalls until someone notices...? Similarly sometimes the required selection of build nodes are all offline, or Agent failed...

Having a task that detects this, and Alerts on Slack to the situation would be useful?

Sep 22 '20 19:09 andrew-m-leonard

Related https://github.com/AdoptOpenJDK/openjdk-build/issues/1101

I introduced a rough timeout a while back as part of the above issue that will abort the entire pipeline after 18 hours. This should doesn't include the process of searching for a node label however (that is defined in the build_base_file which calls the file that the timeouts were introduced in) https://github.com/AdoptOpenJDK/openjdk-build/blob/773c2ebc6102a1d09d194c19eb236edda7cb7e50/pipelines/build/common/build_base_file.groovy#L425

What I propose is an overhaul of the current timeout system to set specified timeouts for each "stage" of the build. As an example, the bit where the job searches for a downstream node could be 4 hours, the bit where it runs the actual build could be 14, etc. This way, if a job gets stalled due to a node being pulled offline for a long time, it will crash and won't sit there until someone notices.

What do you think @andrew-m-leonard ? We could try revamping how long Jenkins searches for nodes for if that idea doesn't work?

Sep 24 '20 15:09 M-Davies

Hi Morgan, yes I remember you looking at the timeout options. I am wondering if we can be even more clever maybe, i've got some ideas:

Monitor "Consoles" for known hang situations,eg. "No nodes of label: build&linux&&x64", and perform a query of that label to check they are all offline?
Monitor "Consoles" for stalling,eg.no output within a certain duration, with no obvious message (eg.waiting for node). Period could be quite short, say 10mins with no output? then Alert to Slack. We could "tune" this and make it more clever if it over/under Alerts...?

Sep 24 '20 16:09 andrew-m-leonard

by the way, i've no idea how you would "grab" the Console!! i'm hoping there's a Jenkins API?

Sep 24 '20 16:09 andrew-m-leonard

Ok so my current thinking is we do two things:

Overhaul the timeout system so that nodes that go down are noticed sooner by jenkins. This will also be a good opportunity to revisit this topic that was a little hastily pushed in a few months ago. If a node does go down and the pipeline gets jammed, we move onto step 2
When a timeout is reached, we use this kind soul's answer on StackOverflow to pull in the console output and search through it for the reason of why it's stalled (e.g. Node offline, Querying Adopt API..., etc.).
- If its a node issue, we would utilise the Jenkins API to search for online nodes. So, for example, if its a jdk-windows-x64-hotspot job, we would iterate through the nodes attribute in https://ci.adoptopenjdk.net/label/win2012&&vs2017&&build&&windows&&x64/api/json?pretty=true (the bit that starts with win2012 is the node label that is created and is what jenkins uses to find a corresponding node to the job in question).
  - If we don't find a valid one, we fail the build because there are no valid nodes on Jenkins that can run the job.
  - If we do find a valid one, we run a final query against the nodeName itself to check if it's online https://ci.adoptopenjdk.net/computer/build-softlayer-win2012r2-x64-2/api/json?pretty=true. We'll spit the output of the iteration into the console log so people can quickly see how many nodes are online or not.
- If it isn't a node issue, we'll just fail the build with a generic message like Timeout limit exceeded. See above for error yada yada yada
Once that's done, we fail the build with the relevant reason. Automated messages to slack has always been a touched subject (see https://github.com/AdoptOpenJDK/openjdk-build/issues/1504) so I would recommend opening a new slack channel for these notifications so that we don't flood #build or #infrastructure with spam.

Note, this is dependant on https://github.com/AdoptOpenJDK/openjdk-jenkins-helper/issues/38 being accepted

Sep 28 '20 13:09 M-Davies

What sort of Timeout are we looking at in (1) ?

Sep 28 '20 14:09 andrew-m-leonard

Another thought i've had is a simple one, there are a certain set of node "label" allocations that always need to be available, eg.installer nodes, "build node sets for each platform", "tests node sets for each platform" We could have a job that say runs every 30mins, and checks each "set" has at least 1 node available. If not then it "mark"s that set as a possible problem, if when the job run 30mins later and that not set is still unavailable, then a Slack alert is sent, if that "set" has recovered then the "mark" is cleared.

Sep 28 '20 14:09 andrew-m-leonard

What sort of Timeout are we looking at in (1) ?

Depends on each stage of the build. As I said above, the current timeout is 18 hours for the whole build (including downstream installers and tests). I want to rejig it so we have set timeouts for each stage. So signing will have one timeout, node finding will have another, tests another and so on. Each one will have different sizes depending on how long each on is likely to take.

Another thought i've had is a simple one, there are a certain set of node "label" allocations that always need to be available, eg.installer nodes, "build node sets for each platform", "tests node sets for each platform" We could have a job that say runs every 30mins, and checks each "set" has at least 1 node available. If not then it "mark"s that set as a possible problem, if when the job run 30mins later and that not set is still unavailable, then a Slack alert is sent, if that "set" has recovered then the "mark" is cleared.

The label sets are fairly hard to understand at a quick glance (and there isn't currently a way to match a set to a specific job) so it would be pretty difficult to understand what nodes are affected by a Slack alert message. What's more, each version, OS, arch and variant require a unique label set so it would be hard to match a job to a node if you see what I'm saying. I'm open to discussion however as the job idea is a useful one (having everything tied up in one place) but I feel that we could merge this into the main build scripts fairly easily without having to create a completely new job.

Sep 28 '20 14:09 M-Davies

So there's 2 scenarios i'm thinking of addressing:

A "node set" is offline preventing anything wanting that "set" from running...
A job(s) has/have "hung" blocking new jobs

Scenario 1 So for scenario (1) i'm thinking of a more immediate stall detection, 18hours is too long to wait, for example:

Take the "create_installer_mac" job, if you look at the Configuration for the job it has to run on a node matching label: "mac&&macos10.14&&xcode10". This would be a node set that has to have at least 1 node online, if the "detection job" detects that there are none after 30mins it would Alert to Slack
Take "create_installer_windows", same again if "windows&&wix" has none online after 30mins...
Then for each build platform, eg.pLinux: "build&&linux&&ppc64le", jdk15 Mac: "macos10.14&&build&&mac&&x64". These strings can be dynamically calculated from the same config as the job re-gen script that reads: https://github.com/AdoptOpenJDK/openjdk-build/tree/813df912749918b2bece5a3d7b8b91b8f637d5d4/pipelines/jobs/configurations

I'd rather a job that Alerts too often to start with, and then we can "fine tune" to make better...?

Scenario 2 This is where a if "no Console Output" after a certain duration would help. If a job has not output anything after 30mins, an Alert would fire...

Sep 28 '20 15:09 andrew-m-leonard

I believe there is a Jenkins plugin that will accomplish what you are describing (so you do not end up having to write/own/maintain a custom script). https://plugins.jenkins.io/build-failure-analyzer/

@AdamBrousseau can likely answer usage questions you might have...

Note there is also a https://plugins.jenkins.io/log-parser/ plugin if that is of any interest.

Sep 28 '20 19:09 smlambert

@smlambert the log-parser might be useful. The build-failure-analyser only works on "completed" builds, we're aiming to detect still running jobs that are "hung" or waiting for a set of resources that are currently permanently unavailable...

Sep 29 '20 08:09 andrew-m-leonard

@smlambert as so happens last night's jdk11 Nightly is hung due to this issue I have just raised: https://github.com/AdoptOpenJDK/openjdk-tests/issues/1980 Until someone happens to notice it and kills the job it blocks Nightly runs...has been known to be several days before!!

Sep 29 '20 08:09 andrew-m-leonard

Seems last night's builds is a great example, another hang for create_installer_linux: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1575

Sep 29 '20 09:09 andrew-m-leonard

and another! https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1576

Sep 29 '20 09:09 andrew-m-leonard

This also highlights how woefully short of machines we are.... nightly builds stalled due to single machines being offline!!

Sep 29 '20 09:09 andrew-m-leonard

@andrew-m-leonard We need someone to figure out what we need. Could you take the lead, ask around and compile a list?

Best would be something like:

Present:
x build workers x86 (y cores, w GB RAM, z GB disk, virtualized/cloud)
x test workers x86 (y cores, w GB RAM, z GB disk, bare metal)

Required:
x build workers ppc64le (y cores, w GB RAM, z GB disk, virtualized/cloud)
x test workers ppc64le (y cores, w GB RAM, z GB disk, bare metal)

And not only short-term requirements, but for the next year at least, too.

We have some other problems, like a single machine for Linux package builds. But that's not due to machine shortage.

Sep 29 '20 09:09 aahlenst

@aahlenst yes good plan, i'll do that @sxa @Haroon-Khel @Willsparker Do we have any current "resource lists" anywhere?

Sep 29 '20 09:09 andrew-m-leonard

https://github.com/AdoptOpenJDK/openjdk-build/issues/1044 is the issue where I've logged known "single points of failure"

Sep 29 '20 10:09 sxa

re: https://github.com/AdoptOpenJDK/openjdk-build/issues/2093#issuecomment-700566178 I am surprised it can be 'days' as the test pipelines will timeout and be aborted after 10hrs, unless someone is changing the timeout parameter to be longer than the default (10hrs).

Sep 29 '20 13:09 smlambert

Does each job that is launched by the top level pipeline have its own timeout set? in particular the jobs that continue to hang, like the create_linux_installer job mentioned in https://github.com/AdoptOpenJDK/openjdk-build/issues/2093#issuecomment-700574554.

Each step in the pipeline needs to live up to certain requirements (they clean up after themselves, they abort themselves after a certain timeout factor, they report in the console output their inputs/output metadata as outlined in https://github.com/AdoptOpenJDK/TSC/issues/158#issuecomment-653664841).

This issue is coming at it from top down, the parent pipeline monitors all the children. Each step (child job) should be reviewed to ensure it is meeting the requirements of a well-designed, autonomous pipeline, so there is less work/monitoring/code required in the parent.

Sep 29 '20 13:09 smlambert

re: #2093 (comment) I am surprised it can be 'days' as the test pipelines will timeout and be aborted after 10hrs, unless someone is changing the timeout parameter to be longer than the default (10hrs).

You're correct, but from my experience I have seen backlogs over a day before

Sep 29 '20 14:09 andrew-m-leonard

Adding to the comment about well behaved jobs, I would like to think and see that all jobs output periodic progress information so that we can gather that a given job really has hung as soon as possible, say within 15mins of no progress output. The exception to that would be the message "Still waiting to schedule task" ie."Queue" status.

Sep 30 '20 09:09 andrew-m-leonard

I might even suggest going as far as having a "dead job monkey", that goes around finding "Nightly" jobs that have "hung" and Aborting them automatically....?

Sep 30 '20 09:09 andrew-m-leonard

@andrew-m-leonard By nightly jobs, I assume you are referring to the top level jobs AS WELL as their dominions (e.g. signing, installers, tests, etc). If that's the case, I can see how having a completely separate job may be beneficial since it would be both easier to search for downstream jobs AND to add new jobs that are outside of the pipeline but still need to be detected if they get stuck (sxa's process check for example).

However, if the scope of this issue is just to check if the top level jobs have failed then I don't feel it would be necessary to create a whole new job. We could just integrate this functionality into the pipelines themselves.

Sep 30 '20 09:09 M-Davies

@M-Davies so we could add into the toplevel Pipeline logic... However, what if someone starts/re-builds a build or test job and that hangs...? it would still have the undesired effect... I would like to put the idea down that we have an Adopt CI "job policy" that any job that appears "hung" after a duration of say 30mins is automatically Aborted...? ie.a "fail fast" analogy

I think as a separate thing we may also want to have a "Nodes Offline Detector" job as well, which detects for example if "ALL zLinux nodes are offline", "ALL Windows Installer nodes are offline", ...?

Sep 30 '20 10:09 andrew-m-leonard

Maybe not 30mins as some of the test jobs can take up to 6hrs! But yeah, I'd agree that we need a system outside of the pipeline logic if the scope is to cover all the downstream jobs as well. I rather like the idea of having one job for detecting hung jobs and one job for detecting if all the nodes of specific OS, arch or node label combination.

Sep 30 '20 10:09 M-Davies

Agreed. Perhaps we should go a step further even, and have a graph of jobs for all the pipelines, along with acceptable time ranges for each node. An external job could run hourly and consult the latest run of each job/graph-node, identifying jobs which have exceeded the acceptable range of times (too long and too short), and perform the appropriate action (restart/notify/cleanup/combo).

Notes:

"Acceptable time ranges" could be as simple as "The range of times in the last 3 successful executions, =/- 50%"
We'd need to exclude job numbers which have already been checked, to prevent us reporting the same job many times.

Sep 30 '20 10:09 adamfarley

I wonder how much the existing pipeline plugins can help here - we know they've been buggy in places but it might be worth upstreaming fixes there

Sep 30 '20 14:09 karianna

IMHO, if all jobs adhere to the same design principles, there is less need or no need for separate jobs to manage other jobs. Adding more jobs, rather than making the existing ones more self-managing increases technical debt (jobs to manage jobs that manage other jobs... my devops nightmare). All self-managing jobs should behave similarly and all follow the same set of design principles as it will make it easier to manage/maintain and easier for contributors to help across components.

Design Principle 1: Self-managing jobs manage their own timeouts/resources (distributed versus centralized responsibility) Timeouts: Each job maintains their own timeouts. Based on the expected execution time with a good amount of padding to account for machine variations etc, every job has a timeout after which time, if it has not completed, aborts itself. Nodes: Every job checks if there is a node with the set of labels it needs to run on, if none exist (and are online), the job should exit with a useful message. If there is a node, but its busy, wait in a queue for XX number of acceptable minutes/hours. Labels: standard label schemas should be used, to avoid having to wade through code to find out what labels/nodes a job needs in order to run. Required labels are part of the input information to print to console (see design principle 2).

Design Principle 2: Report all inputs and outputs (related: https://github.com/AdoptOpenJDK/TSC/issues/158) Every job prints where it is taking its material from (repo / branch / SHAs), commands used, environment variables set, what logs are created/archived, what end product is produced and where it is pushed. We can use the InToto metadata template for this information so that all jobs print in a standardized way, which will make further automation easier.

Design Principle 3: Avoid technical debt decisions For each new feature or enhancement, look to see if there is already a well-maintained open-source option that suits the purpose. If so, use it rather than roll-your-own solutions. If upstream solutions do not quite meet all needs but are close, consider contributing to them to allow their use as preferential to writing/maintaining a separate code base.

It is great to have this issue as a catalyst for a design discussion. I will host an open meeting about the current design and future direction of the build pipelines, tentatively Tuesday, October 6th (8am EDT). I will post details in the Slack #build channel.

Sep 30 '20 17:09 smlambert

I'm not a great fan of timeouts, but there definitely needs to be a set of timeouts for the long term maximum expected duration of a job or stage. When no online nodes are available, I would like some sort of immediate Alert, so someone goes and fixes the problem...

I'd like us to take an Agile small step by step approach of improvement here

Oct 01 '20 08:10 andrew-m-leonard

The problem with the lack of nodes is one for our monitoring to solve. Is there a way to get the queued jobs and the nodes being online out of Jenkins, both with their labels? We could then figure out if there's at least one node (warning) or no node (error). I'm happy to guide someone to write such a check but don't have the time to do it myself.

Oct 01 '20 09:10 aahlenst