docker-compose-buildkite-plugin
docker-compose-buildkite-plugin copied to clipboard
Run docker-compose down before build?
I'm getting errors like the following but I only run one agent per host. I'm pretty sure a prior build got cancelled and didn't properly clean up the running containers. Can an option be added to kill any containers from previous jobs? Or, is there a more robust way to force cleanup to happen after the build completes?
ERROR: for buildkite0163351ce43d43c792276a2da3d2b4e2_redis_1 Cannot start service redis: driver failed programming external connectivity on endpoint buildkite0163351ce43d43c792276a2da3d2b4e2_redis_1 (d35a813f99837fb20d793bcfe8ba884c04c9e24eb8725f1791ecf914afdba253): Bind for 0.0.0.0:6379 failed: port is already allocated
Looking at the log from one of the cancelled builds, it looks like it's terminating without running most of the lifecycle hooks (maybe this is really a buildkite bug?). It seems odd that the agent is lost in these cases since the agent gets reused on the next build.
Hi @ianwremmel. Sorry you've been hitting that. Are you seeing this with the latest version of the agent? We had some signal handling bugs in previous versions, which might have prevented the plugin from having a chance to cleanup properly.
I'm seeing it on v3.0.3
Thanks @ianwremmel! I can’t find 3.0.3 in https://github.com/buildkite/agent/releases 🤔
oh, sorry, I misread, that's the version of the docker-compose plugin.
I'm using version v4.3.3 of the elastic ci stack, so it's whatever agent is in image ami-057de5cbfd86cfe88. If I'm reading the dashboard right, it's agent version v3.12.0.
Ah cool, thanks for finding out the exact agent version @ianwremmel!
Between the hook and this function, we should be cleaning everything up:
- https://github.com/buildkite-plugins/docker-compose-buildkite-plugin/blob/40424b7bfd584e023bde924b8015bfc6ca07f516/hooks/pre-exit
- https://github.com/buildkite-plugins/docker-compose-buildkite-plugin/blob/40424b7bfd584e023bde924b8015bfc6ca07f516/lib/run.bash#L3-L20
The -1 agent lost log output, that's usually if the instance gets forcefully terminated? (or a spot instance doesn't gracefully terminate). If you head to the timeline tab on the job with the -1 exit status, and click through to the agent, did it run any jobs after that one, or was that the last job it was running before it disappeared?
For the job that had the original Bind for 0.0.0.0:6379 failed: port is already allocated error, I wonder if you could click through to the agent on that one too and see what previous jobs it ran and if they were cancelled like you were thinking?
I wonder why it's trying to bind the docker host's port for the redis instance, rather than just using the internal networking between containers? What's the expose port config on those?
Sorry for all the questions! Hopefully something will lead us to a clue, because we should already be doing a docker-compose down 🤔
I've seen the -1 agent lost when i've run out of memory (turns out eslint can't be run on a t2.micro)., but in n this case, i'm pushing a new commit and buildkite is aggressively killing the job (let's call it Job A). I think the -1 agent lost is a bit misleading here; the agent is still available and takes on additional jobs (let's call the next one Job B).
Job B fails with the Bind for 0.0.0.0:6379 failed: port is already allocated error because redis is still running since Job A never had a chance to run docker-compose down.
For the job that had the original Bind for 0.0.0.0:6379 failed: port is already allocated error, I wonder if you could click through to the agent on that one too and see what previous jobs it ran and if they were cancelled like you were thinking?
Yes. It's still taking jobs. I've had to go into AWS to manually kill it so a new one with a normal environment would boot.
I wonder why it's trying to bind the docker host's port for the redis instance, rather than just using the internal networking between containers? What's the expose port config on those?
We're approximating Heroku deployment. The relevant compose files are (we use the array form of the buildkite docker-compose config option) are posted below.
Hopefully something will lead us to a clue, because we should already be doing a docker-compose down
Yea, I looked into the plugin code and I see that it's supposed to be calling down. It seems like one of these
might be causing the job to be killed in a way that doesn't run the cleanup lifecycle hooks.
docker-compose configs:
version: "3.6"
services:
app:
build: .
command: "rails server"
depends_on:
- chrome
- firefox
- postgres
- redis
- selenium
environment:
CAPYBARA_SERVER_HOST: "0.0.0.0"
DATABASE_URL: postgresql://postgres@postgres:5432/postgres
DISABLE_SSL: "I AM SURE"
MEMCACHEDCLOUD_SERVERS: "memcached:11211"
PORT: 3000
RAILS_ENV: test
RAILS_LOG_TO_STDOUT: "true"
REDIS_URL: "redis://redis:6379"
SELENIUM_DRIVER: ${SELENIUM_DRIVER-docker_chrome}
networks:
redacted:
aliases:
- provider.udlocal.com
- admin.udlocal.com
- api.udlocal.com
- www.udlocal.com
volumes:
- source: ./db
type: bind
target: /app/db
version: "3.6"
services:
chrome: &selenium_node_config
image: selenium/node-chrome
depends_on:
- selenium
environment:
- HUB_HOST=selenium
- HUB_PORT=4444
volumes:
- /dev/shm:/dev/shm
networks:
- redacted
firefox:
<<: *selenium_node_config
image: selenium/node-firefox
memcached:
image: memcached
networks:
- redacted
ports:
- "11211:11211"
postgres:
image: ${DATABASE_IMAGE_URI:-postgres:9.6}
networks:
- redacted
ports:
- "5432:5432"
redis:
image: redis:4.0.11
networks:
- redacted
ports:
- "6379:6379"
selenium:
image: selenium/hub
ports:
- "4444:4444"
networks:
- redacted
networks:
redacted: {}
volumes:
postgres: {}
Has any headway been made on this issue? I'm also running into something similar.
Just commenting to say I just started running into this on my pipeline, after a relatively long time not seeing it, and now I'm seeing it on a relative majority of builds
Hi @glittershark! which agent version are you running? we made several changes on exit status and cleaning dir on the latest releases. Could you confirm if this still happening on the last version? (v3.33.3)
Based on the fact that it has not been reported or upvoted in almost a year I will proceed to close this but feel free to re-open or create a new issue if this is still hapenning
yeah, this seems to have cleared itself up for us 🤔