helpdesk icon indicating copy to clipboard operation
helpdesk copied to clipboard

Monitor builds on our private instances (trusted.ci.jenkins.io / infra.ci.jenkins.io / release.ci.jenkins.io)

Open dduportal opened this issue 3 years ago • 23 comments
trafficstars

The (private) Jenkins controller trusted.ci.jenkins.io has some jobs that are regularly executing importants tasks on some of the Jenkins or Jenkins Infra repositories, such as:

  • https://github.com/jenkins-infra/repository-permissions-updater to ensure Artifactory permissions are up to date with the GitHub maintainer permissions per plugins or components
  • https://github.com/jenkins-infra/update-center2 to ensure update-center.json from https://updates.jenkins.io is up to date with the latest released plugins
  • https://github.com/jenkins-infra/jenkins.io and https://github.com/jenkins-infra/cn.jenkins.io to ensure these websites are up to date
  • Docker containers, reports, etc.

It's but numerous time where these builds were failing to execute, or failing to even be scheduled, which led to outdated artefacts or website, slwoing down (or blocking) users.

We need to have a monitoring of these important jobs to ensure that the team is alerted quickly enough and notifies users proactively as with any major production incidents.

The challenge is that this Jenkins controller is a private one, so we need to control what information are exported (e.g. "simple notifications or github status checks" are risky. Read #2834 if you do not agree :) ).

dduportal avatar Mar 22 '22 12:03 dduportal

Proposal by @daniel-beck after we asked him for help and advise:

If you do not want to provide credentials to monitoring, create a new job on trusted.ci that periodically publishes a JSON file to reports.jenkins.io with information about the other jobs on that instance. Monitoring:

  • Check the file timestamp. If older than 2x build interval, watchdog is dead (if you can't, put the current UTC time during generation into the file as a field and parse that).
  • Check contents for each of the jobs, in whatever format you want to provide information on them. That's the 5 minute hack solution, there are probably better ones depending on what monitoring tools we use and what they can do. Here I assume Last-Modified support (if possible) and JSON parsing.

dduportal avatar Mar 22 '22 12:03 dduportal

Another idea (non mutually exclusive): notification on IRC channel with the controller hostname, job name and status. It would cover the build failure at least (but not the "unable to schedule builds)

dduportal avatar Mar 22 '22 13:03 dduportal

Would it be any easier or more portable to replicate the RSS feeds from trusted.ci.jenkins.io to a publicly visible location?

I've been using the RSS feeds from specific jobs on ci.jenkins.io as a low cost monitoring system for jobs that I select. It is visible as a small icon on my Google Chrome browser.

I use https://feeder.co/ to monitor the job failures RSS feeds like https://ci.jenkins.io/job/Infra/job/acceptance-tests/job/check-agent-availability/rssFailed . That shows a small number on my Google Chrome web browser bar when there is a failure. When I have time and when I notice the failure count, I click the RSS feed and it opens the page with the failure.

I think this may still be more complicated than Daniel's idea of a job on trusted.ci.jenkins.io that exports failures to a public location.

As an angle on Daniel's idea, I have a separate Python script that I use today with my Jenkins test instance to report if a job associated with a resolved Jenkins Jira issue is failing. I may try an experiment to convert that script into a Jenkins job that might be reusable as the type of "inside Jenkins" job monitor that Daniel has described.

MarkEWaite avatar Mar 29 '22 17:03 MarkEWaite

Would it be any easier or more portable to replicate the RSS feeds from trusted.ci.jenkins.io to a publicly visible location?

An additional translation layer we control would be useful IMO. E.g. renaming a job on trusted CI should not break monitoring. Exposing history beyond the latest build is likely also unnecessary.

daniel-beck avatar Mar 29 '22 17:03 daniel-beck

@daniel-beck I opened https://github.com/jenkins-infra/infra-reports/pull/62, could you give me your opinion about it when you have some time please?

lemeurherve avatar Mar 01 '24 19:03 lemeurherve

@daniel-beck I opened jenkins-infra/infra-reports#62, could you give me your opinion about it when you have some time please?

This proposal, despite being interesting technically, has the big downside of requiring privileged credential to reach each controller, which is unsafe.

dduportal avatar Dec 03 '24 14:12 dduportal

Let's go with the "boring but efficient" proposal:

  • We create a new function in our shared pipeline library which publishes status to a private reports.jenkins.io website
    • Using textual or JSON data with the controller hostname, the job name, the build status and the timestamp
  • Then, we monitor these JSON files in datadog to alert us when:
    • The status is not "SUCCESS"
    • Or when the timestamp is older than a certain amount of time (configurable by jobs because we don't run them at the same frequency)

This proposal:

  • Does not require any kind of credential on the sensitive controller
  • Only export a few information which are already public (job names and controller names) or easy to get, or non sensitive
  • Does not require any direct access from the monitoring to the controller, neither it require a plugin or custom routine on the VMs/pods

dduportal avatar Dec 03 '24 14:12 dduportal

Ref

  • https://github.com/jenkins-infra/helpdesk/issues/1971
  • https://github.com/jenkins-infra/helpdesk/issues/1970
  • https://github.com/jenkins-infra/helpdesk/issues/4383

dduportal avatar Dec 03 '24 14:12 dduportal

Delaying this issue in favor of https://github.com/jenkins-infra/helpdesk/issues/4539 so that @jayfranco999 can have a first real life experience on the groovy pipeline libraries.

dduportal avatar Feb 19 '25 14:02 dduportal