control-tower
control-tower copied to clipboard
self update job resets team permissions
I have GitHub auth federation set up and a "main" team in Concourse that looks like this:
roles:
- name: owner
local:
users: ["admin"]
- name: pipeline-operator
github:
teams: ["myorg:myteam"]
Every time I run the "self update" job, it resets all the permissions and I have to log in as the root admin user and re-apply my team's permissions with the yaml file.
I don't know if this is related to GitHub auth federation.
I feel like this is a bug and the self update should not change permissions.
Yeah we face a similar issue on our own Concourse where we have github auth on the main team. Main team auth is configured as part of the BOSH manifest when deploying Concourse and we don't expose the flags through Control Tower. This means every deploy will apply the manifest and wipe custom main team auth. Auth on other teams shouldn't be impacted. Given the current implementation of Control Tower this is expected behaviour.
I made concourse-mgmt in an attempt to create tooling for managing Concourse teams from Concourse. Our mitigation for this problem is to run a variation of that pipeline every 10 minutes that ensures team auth is set properly.
We are now seeing this happen (i.e. all pipelines disappear but resetting the auth brings them back) much more frequently. This is happening at least once a week, even though our self update job has only been run twice in the past 3 months.
I don't really know how to diagnose or investigate this. Any suggestions would be gratefully received.
I haven't attempted to move to a non-"main" team, as I understand we will lose all our history. Perhaps it's worth taking that hit if the "main" team is not usable for a default install of control-tower.
The Control Tower instance we use at EngineerBetter has all the pipelines in the main team with github auth configured. I'm not aware of auth getting wiped outside of upgrades. We do run a pipeline that re-applies the team config every 10 minutes though so it might be hiding the issue.
In theory if the github auth config is getting stripped from the main team outside of control tower upgrades then it's either bosh recreating the web instance (the main team as defined in the manifest only has basic auth) or it's a bug in Concourse. I guess you could check if your web instances are getting restarted.
We do run a pipeline that re-applies the team config every 10 minutes though so it might be hiding the issue.
We did that, but it has made things worse; the web machine is being killed and restarted fairly frequently now, making the web ui unusable.
We are attempting to investigate to see if we can figure out why. Any suggestions gratefully received!
Looks like OOM on the "web" machine is causing this restart loop.
We will try deploying a larger server with control-tower deploy --web-size medium
.
Any idea why this might happen or how to stop it happening again? We're not intending to do anything unusual with control-tower and were hoping to not need to peek inside the black box.
2022-03-16T15:45:22.085786+00:00 8593cba8-5f6e-4d7e-95b1-012eee77b396 kernel: [ 1294.206443] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=runc-bpm-uaa.scope,mems_allowed=0,global_oom,task_memcg=/,
task=influxd,pid=11559,uid=1000
2022-03-16T15:45:22.085787+00:00 8593cba8-5f6e-4d7e-95b1-012eee77b396 kernel: [ 1294.206465] Out of memory: Killed process 11559 (influxd) total-vm:4456212kB, anon-rss:1019316kB, file-rss:0kB, shmem-rss:0kB, UI
D:1000 pgtables:7120kB oom_score_adj:0
2022-03-16T15:45:22.207469+00:00 8593cba8-5f6e-4d7e-95b1-012eee77b396 kernel: [ 1294.361999] oom_reaper: reaped process 11559 (influxd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
We colocate influxdb and grafana on the web vm for the out-of-the-box metrics. I guess it's possible that Concourse is producing a high volume of metrics which is using up too much memory. I've also seen it before where having a frequent refresh rate on the grafana dashboard slows down the web instance. I would expect scaling the size of the web vm might resolve it.
Thanks.
Increasing the instance size does seem to have helped so far. We'll keep an eye on it. I'll update here if we have anything further.
It's not ideal that the web machine enters a restart loop when under memory pressure. Ideally it would just run slower.
(Also, it's definitely not ideal that bosh auto restarting the web machine wipes the team permissions.)
(We don't use influxdb or grafana. I think I asked on a different issue how to turn them off.)
I added a flag last night that lets you opt out of deploying the colocated metrics stack. If you download the new release then you can deploy with --no-metrics
to get rid of those extra processes.
Thanks, unfortunately I tried using this release to deploy with the new flag and hit the following error:
Error getting CPI info:
Executing external CPI command: '/home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/jobs/aws_cpi/bin/cpi':
Running command: '/home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/jobs/aws_cpi/bin/cpi', stdout: '', stderr: 'bundler: failed to load command:/home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/bin/aws_cpi (/home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/bin/aws_cpi)
/home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/3.1.0/net/https.rb:23:in `require': cannot load such file -- openssl (LoadError)
Did you mean? open3
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/3.1.0/net/https.rb:23:in `<top (required)>'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/seahorse/client/net_http/connection_pool.rb:5:in `require'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/seahorse/client/net_http/connection_pool.rb:5:in `<top (required)>'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/seahorse.rb:36:in `require_relative'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/seahorse.rb:36:in `<top (required)>'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/aws-sdk-core.rb:4:in `require'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/vendor/bundle/ruby/3.1.0/gems/aws-sdk-core-3.113.1/lib/aws-sdk-core.rb:4:in `<top (required)>'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/lib/cloud/aws.rb:5:in `require'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/lib/cloud/aws.rb:5:in `<top (required)>'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/bin/aws_cpi:7:in `require'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/bosh_aws_cpi/bin/aws_cpi:7:in `<top (required)>'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli/exec.rb:58:in `load'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli/exec.rb:58:in `kernel_load'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli/exec.rb:23:in `run'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli.rb:484:in `exec'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli.rb:31:in `dispatch'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/vendor/thor/lib/thor/base.rb:485:in `start'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/cli.rb:25:in `start'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/gems/3.1.0/gems/bundler-2.3.5/exe/bundle:48:in `block in <top (required)>'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/site_ruby/3.1.0/bundler/friendly_errors.rb:103:in `with_friendly_errors'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/lib/ruby/gems/3.1.0/gems/bundler-2.3.5/exe/bundle:36:in `<top (required)>'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/bin/bundle:25:in `load'
from /home/ssm-user/.bosh/installations/9db8d17f-127d-4d76-4d04-58408c85d780/packages/ruby-3.1.0-r0.81.0/bin/bundle:25:in `<main>'
':
exit status 1
Exit code
I found the same issue with the 0.18.0 and 0.18.1 releases, and had to go back to 0.17.30 to complete a successful deployment.
Weird. That doesn't look like it should be related to anything in the new release(s). Sometimes the contents of the local ~/.bosh
directory can get inexplicably broken. You could try deleting/renaming that directory and trying again.
Another possibility is that one of the bosh prerequisites has gotten broken somehow on your machine.
Deleting that directory and updating all the prereqs fixed it, thanks! I noticed that although Grafana etc. are no longer running, which is great, there are still security group rules added in the -atc group for ports 3000, 8844 and 8443, all of which I believe are related to metrics. It would be nice if when using the --no-metrics
flag these rules weren't created, since they are unneeded. Thanks for adding the flag, it's good to know that we've not got them using up space/memory unneccessarily anymore :)
I forgot about the firewall ports. I'll look into patching that out.
I'm glad you managed to get it deployed 😄. Why the ~/.bosh
directory sometimes breaks is still a mystery to me even after all these years of working with BOSH...
Increasing the instance size does seem to have helped so far. We'll keep an eye on it. I'll update here if we have anything further.
This does seem to have fixed things for us. Thanks for your help.
(The original issue at the top of this thread remains, AFAIK)
I cut 0.18.2 over the weekend to remove the metrics ports from the firewall when disabling metrics. FYI ports 8844 and 8443 are credhub and UAA respectively so they are still required.
The original issue is more of a feature request to configure github auth on the main team. I'll leave the issue open until that gets looked at.
I just cut 0.19.0 which adds flags for configuring github auth on the main team at deploy time. These settings should persist through web recreations.
A small note is that the Concourse release options I chose to use only support setting the owner
role on the main team. There is a more free-form option in the release where you can provide your own config which would support configuring other roles but I wasn't sure how to cleanly let users pass multiline strings to flags in Control-Tower so I left it out for now.