docker-compose-buildkite-plugin Run docker-compose down before build?

Run docker-compose down before build?

Open ianwremmel opened this issue 5 years ago • 10 comments

I'm getting errors like the following but I only run one agent per host. I'm pretty sure a prior build got cancelled and didn't properly clean up the running containers. Can an option be added to kill any containers from previous jobs? Or, is there a more robust way to force cleanup to happen after the build completes?

ERROR: for buildkite0163351ce43d43c792276a2da3d2b4e2_redis_1  Cannot start service redis: driver failed programming external connectivity on endpoint buildkite0163351ce43d43c792276a2da3d2b4e2_redis_1 (d35a813f99837fb20d793bcfe8ba884c04c9e24eb8725f1791ecf914afdba253): Bind for 0.0.0.0:6379 failed: port is already allocated

Jul 09 '19 20:07 ianwremmel

Looking at the log from one of the cancelled builds, it looks like it's terminating without running most of the lifecycle hooks (maybe this is really a buildkite bug?). It seems odd that the agent is lost in these cases since the agent gets reused on the next build.

Jul 10 '19 21:07 ianwremmel

Hi @ianwremmel. Sorry you've been hitting that. Are you seeing this with the latest version of the agent? We had some signal handling bugs in previous versions, which might have prevented the plugin from having a chance to cleanup properly.

Jul 17 '19 04:07 toolmantim

I'm seeing it on v3.0.3

Jul 17 '19 17:07 ianwremmel

Thanks @ianwremmel! I can’t find 3.0.3 in https://github.com/buildkite/agent/releases 🤔

Jul 17 '19 21:07 toolmantim

oh, sorry, I misread, that's the version of the docker-compose plugin.

I'm using version v4.3.3 of the elastic ci stack, so it's whatever agent is in image ami-057de5cbfd86cfe88. If I'm reading the dashboard right, it's agent version v3.12.0.

Jul 17 '19 21:07 ianwremmel

Ah cool, thanks for finding out the exact agent version @ianwremmel!

Between the hook and this function, we should be cleaning everything up:

https://github.com/buildkite-plugins/docker-compose-buildkite-plugin/blob/40424b7bfd584e023bde924b8015bfc6ca07f516/hooks/pre-exit
https://github.com/buildkite-plugins/docker-compose-buildkite-plugin/blob/40424b7bfd584e023bde924b8015bfc6ca07f516/lib/run.bash#L3-L20

The -1 agent lost log output, that's usually if the instance gets forcefully terminated? (or a spot instance doesn't gracefully terminate). If you head to the timeline tab on the job with the -1 exit status, and click through to the agent, did it run any jobs after that one, or was that the last job it was running before it disappeared?

For the job that had the original Bind for 0.0.0.0:6379 failed: port is already allocated error, I wonder if you could click through to the agent on that one too and see what previous jobs it ran and if they were cancelled like you were thinking?

I wonder why it's trying to bind the docker host's port for the redis instance, rather than just using the internal networking between containers? What's the expose port config on those?

Sorry for all the questions! Hopefully something will lead us to a clue, because we should already be doing a docker-compose down 🤔

Jul 17 '19 23:07 toolmantim

I've seen the -1 agent lost when i've run out of memory (turns out eslint can't be run on a t2.micro)., but in n this case, i'm pushing a new commit and buildkite is aggressively killing the job (let's call it Job A). I think the -1 agent lost is a bit misleading here; the agent is still available and takes on additional jobs (let's call the next one Job B).

Job B fails with the Bind for 0.0.0.0:6379 failed: port is already allocated error because redis is still running since Job A never had a chance to run docker-compose down.

For the job that had the original Bind for 0.0.0.0:6379 failed: port is already allocated error, I wonder if you could click through to the agent on that one too and see what previous jobs it ran and if they were cancelled like you were thinking?

Yes. It's still taking jobs. I've had to go into AWS to manually kill it so a new one with a normal environment would boot.

I wonder why it's trying to bind the docker host's port for the redis instance, rather than just using the internal networking between containers? What's the expose port config on those?

We're approximating Heroku deployment. The relevant compose files are (we use the array form of the buildkite docker-compose config option) are posted below.

Hopefully something will lead us to a clue, because we should already be doing a docker-compose down

Yea, I looked into the plugin code and I see that it's supposed to be calling down. It seems like one of these Screen Shot 2019-07-17 at 4 19 48 PM might be causing the job to be killed in a way that doesn't run the cleanup lifecycle hooks.

docker-compose configs:

version: "3.6"

services:
  app:
    build: .
    command: "rails server"
    depends_on:
      - chrome
      - firefox
      - postgres
      - redis
      - selenium
    environment:
      CAPYBARA_SERVER_HOST: "0.0.0.0"
      DATABASE_URL: postgresql://postgres@postgres:5432/postgres
      DISABLE_SSL: "I AM SURE"
      MEMCACHEDCLOUD_SERVERS: "memcached:11211"
      PORT: 3000
      RAILS_ENV: test
      RAILS_LOG_TO_STDOUT: "true"
      REDIS_URL: "redis://redis:6379"
      SELENIUM_DRIVER: ${SELENIUM_DRIVER-docker_chrome}
    networks:
      redacted:
        aliases:
          - provider.udlocal.com
          - admin.udlocal.com
          - api.udlocal.com
          - www.udlocal.com
    volumes:
      - source: ./db
        type: bind
        target: /app/db

version: "3.6"

services:
  chrome: &selenium_node_config
    image: selenium/node-chrome
    depends_on:
      - selenium
    environment:
      - HUB_HOST=selenium
      - HUB_PORT=4444
    volumes:
      - /dev/shm:/dev/shm
    networks:
      - redacted

  firefox:
    <<: *selenium_node_config
    image: selenium/node-firefox

  memcached:
    image: memcached
    networks:
      - redacted
    ports:
      - "11211:11211"

  postgres:
    image: ${DATABASE_IMAGE_URI:-postgres:9.6}
    networks:
      - redacted
    ports:
      - "5432:5432"

  redis:
    image: redis:4.0.11
    networks:
      - redacted
    ports:
      - "6379:6379"

  selenium:
    image: selenium/hub
    ports:
      - "4444:4444"
    networks:
      - redacted

networks:
  redacted: {}

volumes:
  postgres: {}

Jul 17 '19 23:07 ianwremmel

Has any headway been made on this issue? I'm also running into something similar.

Mar 05 '20 14:03 alexkohler

Just commenting to say I just started running into this on my pipeline, after a relatively long time not seeing it, and now I'm seeing it on a relative majority of builds

Aug 16 '21 23:08 glittershark

Hi @glittershark! which agent version are you running? we made several changes on exit status and cleaning dir on the latest releases. Could you confirm if this still happening on the last version? (v3.33.3)

Oct 04 '21 21:10 pzeballos

Based on the fact that it has not been reported or upvoted in almost a year I will proceed to close this but feel free to re-open or create a new issue if this is still hapenning

Sep 21 '22 03:09 toote

yeah, this seems to have cleared itself up for us 🤔

Sep 21 '22 17:09 glittershark

docker-compose-buildkite-plugin docker-compose-buildkite-plugin copied to clipboard

Run docker-compose down before build?

docker-compose-buildkite-plugin
docker-compose-buildkite-plugin copied to clipboard