scoring-load-balancer-plugin icon indicating copy to clipboard operation
scoring-load-balancer-plugin copied to clipboard

Simultaneous timed builds omit the first selected node from the executor list for the second scored build

Open Gitaffe opened this issue 10 months ago • 12 comments

Jenkins and plugins versions report

Environment
Jenkins: 2.440.2
OS: Linux - 5.14.0-362.8.1.el9_3.x86_64
Java: 21.0.2 - Eclipse Adoptium (OpenJDK 64-Bit Server VM)
---
PrioritySorter:5.1.0
ace-editor:1.1
adoptopenjdk:1.5
analysis-model-api:12.1.0
ansible:307.va_1f3ef06575a_
ansicolor:1.0.4
ant:497.v94e7d9fffa_b_9
antisamy-markup-formatter:162.v0e6ec0fcfcf6
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
apache-httpcomponents-client-5-api:5.3.1-1.0
asm-api:9.6-3.v2e1fa_b_338cd7
authentication-tokens:1.53.v1c90fd9191a_b_
aws-credentials:231.v08a_59f17d742
aws-java-sdk:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-api-gateway:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-autoscaling:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-cloudformation:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-cloudfront:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-codebuild:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-codedeploy:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-ec2:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-ecr:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-ecs:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-efs:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-elasticbeanstalk:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-elasticloadbalancingv2:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-iam:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-kinesis:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-lambda:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-logs:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-minimal:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-organizations:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-secretsmanager:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-sns:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-sqs:1.12.671-445.ve02f9b_558f2e
aws-java-sdk-ssm:1.12.671-445.ve02f9b_558f2e
badge:1.9.1
bootstrap4-api:4.6.0-6
bootstrap5-api:5.3.3-1
bouncycastle-api:2.30.1.77-225.v26ea_c9455fd9
branch-api:2.1152.v6f101e97dd77
build-blocker-plugin:1.7.9
build-failure-analyzer:2.5.0
build-name-setter:2.4.2
build-timeout:1.32
build-timestamp:1.0.3
build-user-vars-plugin:1.9
build-with-parameters:76.v9382db_f78962
caffeine-api:3.1.8-133.v17b_1ff2e0599
changelog-history:1.7
checks-api:2.0.2
claim:543.v0061de6b_dcb_d
cloud-stats:336.v788e4055508b_
cloudbees-folder:6.901.vb_4c7a_da_75da_3
command-launcher:107.v773860566e2e
commons-lang3-api:3.13.0-62.v7d18e55f51e2
commons-text-api:1.11.0-95.v22a_d30ee5d36
compact-columns:1.185.vf3851b_4d31fe
conditional-buildstep:1.4.3
config-file-provider:968.ve1ca_eb_913f8c
configurationslicing:548.ve92d48e66b_f8
copy-project-link:106.veb_028794a_844
copyartifact:722.v0662a_9b_e22a_c
credentials:1337.v60b_d7b_c7b_c9f
credentials-binding:657.v2b_19db_7d6e6d
custom-tools-plugin:0.8
dashboard-view:2.508.va_74654f026d1
data-tables-api:2.0.2-1
delivery-pipeline-plugin:1.4.2
dependency-check-jenkins-plugin:5.5.0
dependency-track:4.3.1
dependencyanalyzer:0.7
depgraph-view:1.0.5
description-setter:239.vd0a_6b_785f92d
disk-usage:1.2
display-url-api:2.200.vb_9327d658781
docker-commons:439.va_3cb_0a_6a_fb_29
docker-java-api:3.3.4-86.v39b_a_5ede342c
docker-plugin:1.6
docker-workflow:572.v950f58993843
durable-task:550.v0930093c4b_a_6
ec2:1648.vf3d852e00486
echarts-api:5.5.0-1
email-ext:2.105
email-ext-recipients-column:27.vb_9404db_b_018d
emailext-template:1.5
enhanced-old-build-discarder:1.4
envinject:2.908.v66a_774b_31d93
envinject-api:1.199.v3ce31253ed13
extended-choice-parameter:381.v360a_25ea_017c
extended-read-permission:53.v6499940139e5
external-monitor-job:215.v2e88e894db_f8
extra-columns:1.26
extra-tool-installers:139.v723fee51b_7f2
font-awesome-api:6.5.1-3
forensics-api:2.4.0
generic-webhook-trigger:2.1.0
git:5.2.1
git-client:4.7.0
git-parameter:0.9.19
git-server:114.v068a_c7cc2574
gitea:1.4.7
github:1.38.0
github-api:1.318-461.v7a_c09c9fa_d63
github-branch-source:1785.v99802b_69816c
groovy:457.v99900cb_85593
groovy-postbuild:228.vcdb_cf7265066
gson-api:2.10.1-15.v0d99f670e0a_7
h2-api:11.1.4.199-12.v9f4244395f7a_
handlebars:3.0.8
handy-uri-templates-2-api:2.1.8-30.v7e777411b_148
heavy-job:1.1
htmlpublisher:1.33
http_request:1.18
instance-identity:185.v303dc7c645f9
ionicons-api:56.v1b_1c8c49374e
jackson2-api:2.17.0-379.v02de8ec9f64c
jakarta-activation-api:2.1.3-1
jakarta-mail-api:2.1.3-1
javadoc:243.vb_b_503b_b_45537
javax-activation-api:1.2.0-6
javax-mail-api:1.6.2-9
jaxb:2.3.9-1
jdk-tool:73.vddf737284550
jjwt-api:0.11.5-112.ve82dfb_224b_a_d
job-dsl:1.87
job-node-stalker:1.0.5
job-restrictions:0.8
jobConfigHistory:1229.v3039470161a_d
joda-time-api:2.12.7-29.v5a_b_e3a_82269a_
join:1.21
jquery:1.12.4-1
jquery-detached:1.2.1
jquery3-api:3.7.1-2
jsch:0.2.16-86.v42e010d9484b_
json-api:20240303-41.v94e11e6de726
json-path-api:2.9.0-58.v62e3e85b_a_655
junit:1265.v65b_14fa_f12f0
ldap:719.vcb_d039b_77d0d
lockable-resources:1246.v28b_e4cc6fa_16
log-parser:2.3.3
mailer:472.vf7c289a_4b_420
mapdb-api:1.0.9-40.v58107308b_7a_7
matrix-auth:3.2.2
matrix-project:822.824.v14451b_c0fd42
maven-plugin:3.23
metrics:4.2.21-449.v6960d7c54c69
mina-sshd-api-common:2.12.1-101.v85b_e08b_780dd
mina-sshd-api-core:2.12.1-101.v85b_e08b_780dd
momentjs:1.1.1
multiple-scms:0.8
naginator:1.436.vb_e769dcb_cdf6
next-executions:327.v136ff959e97b_
node-iterator-api:55.v3b_77d4032326
nodelabelparameter:1.12.0
okhttp-api:4.11.0-172.vda_da_1feeb_c6e
pam-auth:1.10
parameterized-scheduler:262.v00f3d90585cc
parameterized-trigger:787.v665fcf2a_830b_
percentage-du-node-column:0.1.0
periodic-reincarnation:1.13
pipeline-aggregator-view:104.v94a_e5f6cdb_c3
pipeline-build-step:540.vb_e8849e1a_b_d8
pipeline-github-lib:42.v0739460cda_c4
pipeline-graph-analysis:216.vfd8b_ece330ca_
pipeline-groovy-lib:704.vc58b_8890a_384
pipeline-input-step:491.vb_07d21da_1a_fb_
pipeline-maven:1396.veb_f07b_2fc1d8
pipeline-maven-api:1396.veb_f07b_2fc1d8
pipeline-milestone-step:119.vdfdc43fc3b_9a_
pipeline-model-api:2.2184.v0b_358b_953e69
pipeline-model-definition:2.2184.v0b_358b_953e69
pipeline-model-extensions:2.2184.v0b_358b_953e69
pipeline-rest-api:2.34
pipeline-stage-step:312.v8cd10304c27a_
pipeline-stage-tags-metadata:2.2184.v0b_358b_953e69
pipeline-stage-view:2.34
pipeline-utility-steps:2.16.2
plain-credentials:179.vc5cb_98f6db_38
plugin-util-api:4.1.0
popper-api:1.16.1-3
popper2-api:2.11.6-4
powershell:2.1
prism-api:1.29.0-13
promoted-builds:945.v597f5c6a_d3fd
publish-over:0.22
publish-over-ssh:1.25
purge-job-history:1.6
rebuild:330.v645b_7df10e2a_
redmine:0.21
resource-disposer:0.23
role-strategy:713.vb_3837801b_8cc
run-condition:1.7
saml:4.464.vea_cb_75d7f5e0
scm-api:689.v237b_6d3a_ef7f
scoring-load-balancer:81.vb_82d531ff049
script-security:1326.vdb_c154de8669
scriptler:348.v5d461e205da_a_
secondary-timestamper-plugin:1.1
simple-theme-plugin:176.v39740c03a_a_f5
sitemonitor:0.6
slack:684.v833089650554
slave-setup:1.16
snakeyaml-api:2.2-111.vc6598e30cc65
sonar:2.17.2
ssh:2.6.1
ssh-agent:346.vda_a_c4f2c8e50
ssh-credentials:334.v7732563deee1
ssh-slaves:2.948.vb_8050d697fec
ssh-steps:2.0.68.va_d21a_12a_6476
ssh2easy:1.6
sshd:3.322.v159e91f6a_550
structs:337.v1b_04ea_4df7c8
subversion:1256.vee91953217b_6
thinBackup:1.19
throttle-concurrents:2.14
timestamper:1.26
token-macro:400.v35420b_922dcb_
translation:1.16
trilead-api:2.142.v748523a_76693
unicorn:0.1.1
uno-choice:2.8.3
validating-string-parameter:183.v3748e79b_9737
variables-replace-plugin:1.0.0
variant:60.v7290fc0eb_b_cd
view-job-filters:369.ve0513a_a_f5524
warnings-ng:11.2.2
windows-slaves:1.8.1
workflow-aggregator:596.v8c21c963d92d
workflow-api:1291.v51fd2a_625da_7
workflow-basic-steps:1049.v257a_e6b_30fb_d
workflow-cps:3889.v937e0b_3412d3
workflow-durable-task-step:1331.vc8c2fed35334
workflow-job:1400.v7fd111b_ec82f
workflow-multibranch:773.vc4fe1378f1d5
workflow-scm-step:427.v4ca_6512e7df1
workflow-step-api:657.v03b_e8115821b_
workflow-support:896.v175a_a_9c5b_78f
ws-cleanup:0.45
zapper:1.0.7
zentimestamp:4.2

What Operating System are you using (both controller, and any agents involved in the problem)?

Linux (Fedora) but running in the ubi9 Docker container from the official DockerHub image

Reproduction steps

  1. In a multi-node Jenkins environment, setup several time-triggered jobs on a regular cron schedule, e.g. every minute, with a matching (or at least overlapping) labels, where one of the nodes has a substantially higher score than the others and would be expected to be used for all executions
  2. Turn on logging for the scoring load balancer

Expected Results

I expect all jobs to run on the highly-preferred node.

Actual Results

Whichever job is parsed second NEVER has available to it the node selected for the first job.

For example, in the below there are 3 agents. Built-in (which appears with no label), and two other agents. We're highly favouring one node for example purposes, but it happens at any scale of preferences.

When all the jobs have the same set of permissible labels we see the below. Note that lnx-aarch64-agent-23a: does NOT appear for the second job bbb. But it is available in the third and fourth jobs.


2024-03-30T17:35:00.826410015Z [2024-03-30 17:35:00][INFO   ][jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores] Scoring for hudson.model.FreeStyleProject@6bfb8cc[infrastructure/utils/heartbeat/aaa]:
2024-03-30T17:35:00.826410015Z   lnx-aarch64-agent-23a: 111120
2024-03-30T17:35:00.826410015Z                       : 4040
2024-03-30T17:35:00.826410015Z      lnx-x86-agent-24a:  220
2024-03-30T17:35:00.827597887Z [2024-03-30 17:35:00][INFO   ][jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores] Scoring for hudson.model.FreeStyleProject@3432ffd0[infrastructure/utils/heartbeat/bbb]:
2024-03-30T17:35:00.827597887Z                       : 4040
2024-03-30T17:35:00.827597887Z      lnx-x86-agent-24a:  220
2024-03-30T17:35:00.831500805Z [2024-03-30 17:35:00][INFO   ][jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores] Scoring for hudson.model.FreeStyleProject@4971380a[infrastructure/utils/heartbeat/ccc]:
2024-03-30T17:35:00.831500805Z   lnx-aarch64-agent-23a: 111040
2024-03-30T17:35:00.831500805Z                       : 3960
2024-03-30T17:35:00.831500805Z      lnx-x86-agent-24a:  220
2024-03-30T17:35:00.843127003Z [2024-03-30 17:35:00][INFO   ][jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores] Scoring for hudson.model.FreeStyleProject@67790f4a[db-status-checking]:
2024-03-30T17:35:00.843127003Z   lnx-aarch64-agent-23a: 111160
2024-03-30T17:35:00.843127003Z                       : 7560
2024-03-30T17:35:00.843127003Z      lnx-x86-agent-24a:  820

This seems consistent no matter which nodes we set to be preferred, what the set of running agents is, and what the set of labels are.

The only variation I've noticed is that when the first job has a different permissible label set (i.e. just the built-in controller), then this node is not available to the second job, but then the node the second job selects is not available to the third job either, but they all come back in the fourth job's set.

2024-03-30T17:32:00.824043352Z [2024-03-30 17:32:00][INFO   ][jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores] Scoring for hudson.model.FreeStyleProject@6eb306c3[infrastructure/utils/built-in-only]:
2024-03-30T17:32:00.824043352Z                       :  440
2024-03-30T17:32:00.825264807Z [2024-03-30 17:32:00][INFO   ][jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores] Scoring for hudson.model.FreeStyleProject@6bfb8cc[infrastructure/utils/heartbeat/aaa]:
2024-03-30T17:32:00.825264807Z   lnx-aarch64-agent-23a: 111200
2024-03-30T17:32:00.825264807Z      lnx-x86-agent-24a:  220
2024-03-30T17:32:00.826496652Z [2024-03-30 17:32:00][INFO   ][jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores] Scoring for hudson.model.FreeStyleProject@3432ffd0[infrastructure/utils/heartbeat/bbb]:
2024-03-30T17:32:00.826496652Z      lnx-x86-agent-24a:  220
2024-03-30T17:32:00.828527696Z [2024-03-30 17:32:00][INFO   ][jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores] Scoring for hudson.model.FreeStyleProject@4971380a[infrastructure/utils/heartbeat/ccc]:
2024-03-30T17:32:00.828527696Z   lnx-aarch64-agent-23a: 111120
2024-03-30T17:32:00.828527696Z                       : 3960
2024-03-30T17:32:00.828527696Z      lnx-x86-agent-24a:  140
2024-03-30T17:32:00.856936815Z [2024-03-30 17:32:00][INFO   ][jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores] Scoring for hudson.model.FreeStyleProject@602cc842[environment-monitoring]:
2024-03-30T17:32:00.856936815Z   lnx-aarch64-agent-23a: 111240
2024-03-30T17:32:00.856936815Z                       : 7560
2024-03-30T17:32:00.856936815Z      lnx-x86-agent-24a:  740

Anything else?

I've gone through the code here and some of the Jenkins core code at https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/queue/MappingWorksheet.java however I'm not very familiar with it and can't see an obvious reason for why the executors would not be available, nor even if it's a Jenkins core or plugin problem.

Without the scoring plugin enabled, the lack of granular logging and the default load balancing mean I can't coerce it to replicate the issue, so if it is the way Jenkins core is run, then it's maybe hidden from making any difference in anything but this plugin. However then one would hope that it would be possible to work around the behaviour to be consistent in this plugin.

Are you interested in contributing a fix?

Unfortunately I've spent some time looking but I don't understand the expected interaction with Jenkins well enough to propose a fix

Gitaffe avatar Mar 30 '24 17:03 Gitaffe

I'm not able either, sorry. but some more details needed to fully reproduce.

  1. exact number of agents (inclusive controller)
  2. exact number of build slots and confgured labels for each agent
  3. configured loadbalancer options for each agent
  4. exact number of jobs and there labels for agent

this maybe helps to reproduce in a test case. Please try to find the minimal settings.

viceice avatar Apr 19 '24 06:04 viceice

Did you see #15 ? I disabled a lot of test to get it working with newer jenkins boms.

So maybe we need to fix the tests or code to get the right behavior.

viceice avatar Apr 19 '24 06:04 viceice

I also think jenkins is trying to use the agent from last run first before it is asking the loadbalancer. So as long as there is a free slot on the agent from previous run, it will run there again?

viceice avatar Apr 19 '24 06:04 viceice

Hi there, sorry was out of the country. On the last comment, no it's not using the agent from the last run. In this particular setup it will always use the preferred node until it's coincident with another job on the timer in which case it doesn't even see the preferred node as being available.

Checking right now, for example, it manifests where every fifth run it will be on the undesirable, lower-rank node because this job is on a 2 minute timer and another is on a 5 minute timer. So 2 * 5 means every ten minutes this floats onto this agent (meaning that agent is perpetually kept alive during the week unless we set an undesirably low idle time since we use on-demand agents).

In answer to the questions, we have a rich setup so I'm not sure this would be that useful. The simplest case is simply with:

  • 2 jobs, one on a 2 minute timer, the other on 5 minutes
  • both having the same label which is on both controller and agent
  • both having job level preference scores making controller the highest
  • System level scoring rules with a scale such that controller should always be chosen
  • more than one executor on the controller and agent
  • in an on-demand setup, the agent needs to be already running (it won't provoke the agent to startup, it just wont consider the controller for the second job in the same instant if the agent is already running)

However to answer in case something is relevant:

  1. In this case 8. But what is relevant is how many are running. If there's no agent running, only the controller (i.e. one agent, with more than one executor), it works fine - the jobs startup on the controller (in parallel too). If there's more than one agent- where the scores are such that the controller should always be favoured, as in the description above, it will always exclude the first-chosen node from the list evaluated for the second job.

  2. Six executors on each is what we run, but anything more than 1 produces the behaviour from our testing (obviously with 1 it CAN'T run in parallel on the same node, so doesn't seem to be relevant). Labels, we have a lot, in this case it's monitoring and there are four nodes labelled like that, but the problem we're trying to solve is when there are two active nodes (a controller and agent) with the label that are running.

The controller also has the labels controller, monitoring-preferred and long-running, which are also used in job-level preference scores, but the non-controller agents don't have this.

  1. The jobs themselves have a Preference of Node hierarchy like:
  • built-in: 200
  • monitoring-preferred: 80
  • long-running: 50
  • monitoring: 30

The System configuration has these rules:

Scoring by Node Preferences Scale for Node Preferences: 10 Scale for Project Preferences: 20

Scoring by Node Loads Scale for Scores: 10 Score for Idle Executor: 4 Score for Busy Executor: -4

However in my experience changing these values doesn't have an effect because - as shown in the original report - even if we artificially massively increase the score for built-in it's simply not considered for where the second job can run.

  1. This can happen with only two scheduled jobs both just requiring the monitoring label. During the week there are many other jobs causing this contention - but this also happens on the weekend when only these two jobs are regular for many hours, and as soon as anything else provokes demand to startup a non-controller agent, it will start using it every ten minutes when the job schedules clash.

I'm sorry I can't immediately identify anything in that disabled tests ticket

Gitaffe avatar Apr 22 '24 11:04 Gitaffe

Some testing:

2 executors on master and "jenkinsagent" each. Both jobs (TestJob1 & TestJob2) start at the same time (by cron/timer). I added a log statement at https://github.com/jenkinsci/scoring-load-balancer-plugin/blob/1ee4e030f17d4eb333e5ce0a1b92c1ed9ea4f2c8/src/main/java/jp/ikedam/jenkins/plugins/scoringloadbalancer/ScoringLoadBalancer.java#L147 at the entrypoint(?) of this plugin.

Jun 21, 2024 2:20:00 PM INFO jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer map
map - Task: hudson.model.FreeStyleProject@77540832[TestJob1], worksheet.executors: [[JobOffer[ #0], JobOffer[ #1]], [JobOffer[jenkinsagent #0], JobOffer[jenkinsagent #1]]]

Jun 21, 2024 2:20:00 PM INFO jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores
Scoring for hudson.model.FreeStyleProject@77540832[TestJob1]:
     jenkinsagent: 1000
                 : -500

Jun 21, 2024 2:20:00 PM INFO jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer map
map - Task: hudson.model.FreeStyleProject@71823f88[TestJob2], worksheet.executors: [[JobOffer[ #0], JobOffer[ #1]]]

Jun 21, 2024 2:20:00 PM INFO jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores
Scoring for hudson.model.FreeStyleProject@71823f88[TestJob2]:
                 : -500

It seems that the node is already missing in the data transmitted by Jenkins.

voruti avatar Jun 21 '24 12:06 voruti

Jun 21, 2024 3:15:00 PM FINEST hudson.model.Queue
JobOffer[ #1] is a potential candidate for task TestJob2

Jun 21, 2024 3:15:00 PM FINEST hudson.model.Queue
JobOffer[jenkinsagent #0] rejected TestJob2: Executor slot already in use

Jun 21, 2024 3:15:00 PM FINEST hudson.model.Queue
JobOffer[ #0] is a potential candidate for task TestJob2

Jun 21, 2024 3:15:00 PM FINEST hudson.model.Queue
JobOffer[jenkinsagent #1] rejected TestJob2: Executor slot already in use

https://github.com/jenkinsci/jenkins/blob/252b7dce8a96257c28e8946b338efc78bb489f6b/core/src/main/java/hudson/model/Queue.java#L291-L292 ~(I'm not sure if this really is the source of this error message, but it very much looks like it.)~ Edit: confirmed.

voruti avatar Jun 21 '24 13:06 voruti

I found the issue: It's the cache in https://github.com/jenkinsci/jenkins/blob/252b7dce8a96257c28e8946b338efc78bb489f6b/core/src/main/java/hudson/model/Queue.java#L1646-L1653 that was introduced for JENKINS-58101.

parked.values() contains all executors from all nodes, so in my example above 4 elements. After TestJob1 is succesfully bound to eg. jenkinsagent Executor#0, there are 3 executors left. When evaluating the executors that are available for TestJob2 (parked.values() still contains the same 4 elements), jenkinsagent Executor#0 is rejected (because it was just occupied by TestJob1) and this rejection is cached for the whole node. When evaluating jenkinsagent Executor#1 the rejection is loaded from the cache, despite the executor beeing available.

voruti avatar Jun 21 '24 16:06 voruti

This issue only occurs for simultaneously started builds, because for builds that are invoked at a different point in time parked.values() only contains executors that are available: grafik

voruti avatar Jun 21 '24 17:06 voruti

I did a simple workaround: https://github.com/jenkinsci/scoring-load-balancer-plugin/compare/main...voruti:scoring-load-balancer-plugin:feature/upstream-issue-30-workaround-missing-executors

voruti avatar Jun 21 '24 19:06 voruti

Thanks! That sounds exactly right and matching my observations.

I can't comment on the correctness of the workaround as I'm not familiar with the code (I presume if this is working the null return value essentially throttles processing of the queue and they're just delayed?) but hope this is considered for merge!

Gitaffe avatar Jun 21 '24 22:06 Gitaffe

Since this really is a hack that comes with it's consequences, like builds beeing delayed for some seconds, and relies on things like system time and magic numbers as well, I'm unsure if this really should be merged as is.

voruti avatar Jun 22 '24 07:06 voruti

can you open a draft PR, so we can work on it together?

viceice avatar Jun 22 '24 09:06 viceice