missioncontrol icon indicating copy to clipboard operation
missioncontrol copied to clipboard

Use smaller docker images in development

Open edmorley opened this issue 7 years ago • 7 comments

Looking at docker-compose.yml I see a few images whose size could be reduced significantly by switching to slimmer variants. This would save significant amounts of time during the initial image pull, which will help both locally when starting fresh plus also in CI (where presumably the images are not cached). Smaller images may also result in faster container startup.

For example (all sizes are compressed sizes):

  • postgres:9.5 is 105MB whereas postgres:9.5-alpine is 14MB
  • redis:3.2 is 37MB whereas redis:3.2-alpine is 8MB

Sadly uhopper/hadoop-namenode and uhopper/hadoop-datanode (both 337MB compressed !!) don't have slimmer variants, plus I can see several mistakes in the upstream uhopper/hadoop Dockerfile that is bloating the image size.

Moving on, mozdata/docker-hive-metastore is a massive 500MB compressed, in part since it depends on the 337MB uhopper/hadoop image, but also because it similarly contains mistakes in its Dockerfile.

Finally, mozdata/docker-presto is a painful 872MB (compressed) - in part because of depending on a 254MB JDK base image, but also because of more missing cleanup in the Dockerfile.

For the last three, a few small three upstream changes will probably make a significant difference to image size.

edmorley avatar Aug 31 '17 19:08 edmorley

@whd -- do you have thoughts on this? From what I remember the docker-compose setup is only used for testing, can we just go ahead and make these changes for some performance wins?

wlach avatar Aug 31 '17 20:08 wlach

@edmorley -- also, could you give some more details on what needs to be fixed in those upstream dockerfiles?

wlach avatar Aug 31 '17 21:08 wlach

could you give some more details on what needs to be fixed in those upstream dockerfiles?

In general to produce the smallest Docker images, best practice is to:

  • Carefully choose the base image for the best balance between image size and re-use with other images.
  • Only install the bare minimum in the Dockerfile (for example don't install SSH servers or editors, and use --no-install-recommends with apt-get)
  • Remove any build-time only dependencies added by the Dockerfile (eg uninstall gcc after using it), and make sure to also remove transient dependencies of them by using apt-get purge -y --auto-remove <dep> not just apt-get remove
  • Clean up caches/temp files (including /var/lib/apt/lists/)
  • See if there are any uneeded files (eg tests/docs) that can be removed (example)
  • When cleaning up packages/deleting files, always do so in the layer that added them (otherwise you don't get the space back, due to the way layers work)
  • For even smaller images, try using an alpine image variant (though more care is required when doing this, not everything is compatible with musl libc)

For some nice examples of this, see:

Plus see: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/

Looking at uhopper/hadoop-datanode and uhopper/hadoop-namenode, those Dockerfiles only create a directory and add a run script, so the size issue comes entirely from the base image: uhopper/hadoop. For that image there are several problems:

  • Whilst one of the apt-get update calls is followed by a rm -rf /var/lib/apt/lists/*, a later one is not.
  • It installs its own version of openjdk-8 into the full sized debian:jessie when perhaps the official openjdk:8-jre-slim might be better (which uses the smaller debian:stretch-slim base image)?
  • The hadoop archive that is downloaded and extracted contains 85MB of files in a directory named docs/ that likely aren't ever going to be used in the Docker image (these aren't man-pages but HTML docs).

For mozdata/docker-hive-metastore, it also uses the above uhopper/hadoop base image (so will benefit from the above), however it should also really:

  • remove wget after using it to download the archives (making sure to use the purge -y --auto-remove form)
  • use --no-install-recommends with apt
  • clean up after apt (rm -rf /var/lib/apt/lists/*)
  • delete the dockerize archive after downloading/extracting it
  • use automated builds triggered from both the GitHub repo and the base image changing, so it can share the same base image (currently it's a year old)

For mozdata/docker-presto, I would suggest:

  • for base image, trying the openjdk (perhaps openjdk:8-jdk-slim) rather than airdock/oracle-jdk which at first glance looks like it would save 150MB (plus it would share debian:stretch-slim with uhopper/hadoop, if that were adjusted too)
  • not installing less
  • removing wget after using it to download the archives (making sure to use the purge -y --auto-remove form)
  • using --no-install-recommends with apt
  • cleaning up after apt (rm -rf /var/lib/apt/lists/*)
  • deleting the dockerize archive after downloading/extracting it
  • using the --no-cache-dir option with pip, to prevent it from pointlessly caching the crudini download
  • either uninstalling pip after using it, or else manually installing crudini rather than using pip at all
  • trying to reduce the duplicated file bloat in the presto archive (this issue is about the RPM, but the same applies to the tarball: prestodb/presto#6380). Ideally upstream would fix this, but failing that using rdfind's -makehardlinks true option would give huge savings (a quick test locally showed 70% uncompressed size reduction of the archive contents - from 538MB to 153MB!).

I did have a quick look to see if there were any official Docker images for the above, or better third-party images but didn't really find much which was quite surprising (some of the other third-party images were horrendous - 2GB+, deleting files in the layer after they were created etc).

For further analysis, I'd try using the microbadger tool (it's a bit flaky but is still helpful), eg: https://microbadger.com/images/mozdata/docker-presto

edmorley avatar Sep 04 '17 12:09 edmorley

trying to reduce the duplicated file bloat in the presto archive

I've filed prestodb/presto#8904 to try and get upstream to fix this.

edmorley avatar Sep 04 '17 12:09 edmorley

@wlach correct, the images mentioned above are only used for local testing.

maurodoglio avatar Sep 05 '17 10:09 maurodoglio

There are still some suggestions in here we could use, so I'm going to reopen for now.

wlach avatar Sep 27 '17 13:09 wlach

Totally agree with the necessity of using slim images in testing and production environments.

gecube avatar Feb 26 '18 10:02 gecube