missioncontrol
missioncontrol copied to clipboard
Use smaller docker images in development
Looking at docker-compose.yml I see a few images whose size could be reduced significantly by switching to slimmer variants. This would save significant amounts of time during the initial image pull, which will help both locally when starting fresh plus also in CI (where presumably the images are not cached). Smaller images may also result in faster container startup.
For example (all sizes are compressed sizes):
-
postgres:9.5
is 105MB whereaspostgres:9.5-alpine
is 14MB -
redis:3.2
is 37MB whereasredis:3.2-alpine
is 8MB
Sadly uhopper/hadoop-namenode
and uhopper/hadoop-datanode
(both 337MB compressed !!) don't have slimmer variants, plus I can see several mistakes in the upstream uhopper/hadoop
Dockerfile that is bloating the image size.
Moving on, mozdata/docker-hive-metastore
is a massive 500MB compressed, in part since it depends on the 337MB uhopper/hadoop
image, but also because it similarly contains mistakes in its Dockerfile.
Finally, mozdata/docker-presto
is a painful 872MB (compressed) - in part because of depending on a 254MB JDK base image, but also because of more missing cleanup in the Dockerfile.
For the last three, a few small three upstream changes will probably make a significant difference to image size.
@whd -- do you have thoughts on this? From what I remember the docker-compose setup is only used for testing, can we just go ahead and make these changes for some performance wins?
@edmorley -- also, could you give some more details on what needs to be fixed in those upstream dockerfiles?
could you give some more details on what needs to be fixed in those upstream dockerfiles?
In general to produce the smallest Docker images, best practice is to:
- Carefully choose the base image for the best balance between image size and re-use with other images.
- Only install the bare minimum in the Dockerfile (for example don't install SSH servers or editors, and use
--no-install-recommends
withapt-get
) - Remove any build-time only dependencies added by the Dockerfile (eg uninstall gcc after using it), and make sure to also remove transient dependencies of them by using
apt-get purge -y --auto-remove <dep>
not justapt-get remove
- Clean up caches/temp files (including
/var/lib/apt/lists/
) - See if there are any uneeded files (eg tests/docs) that can be removed (example)
- When cleaning up packages/deleting files, always do so in the layer that added them (otherwise you don't get the space back, due to the way layers work)
- For even smaller images, try using an alpine image variant (though more care is required when doing this, not everything is compatible with musl libc)
For some nice examples of this, see:
Plus see: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
Looking at uhopper/hadoop-datanode and uhopper/hadoop-namenode, those Dockerfiles only create a directory and add a run script, so the size issue comes entirely from the base image: uhopper/hadoop. For that image there are several problems:
- Whilst one of the
apt-get update
calls is followed by arm -rf /var/lib/apt/lists/*
, a later one is not. - It installs its own version of openjdk-8 into the full sized
debian:jessie
when perhaps the officialopenjdk:8-jre-slim
might be better (which uses the smallerdebian:stretch-slim
base image)? - The hadoop archive that is downloaded and extracted contains 85MB of files in a directory named
docs/
that likely aren't ever going to be used in the Docker image (these aren't man-pages but HTML docs).
For mozdata/docker-hive-metastore, it also uses the above uhopper/hadoop
base image (so will benefit from the above), however it should also really:
- remove wget after using it to download the archives (making sure to use the
purge -y --auto-remove
form) - use
--no-install-recommends
with apt - clean up after apt (
rm -rf /var/lib/apt/lists/*
) - delete the dockerize archive after downloading/extracting it
- use automated builds triggered from both the GitHub repo and the base image changing, so it can share the same base image (currently it's a year old)
For mozdata/docker-presto, I would suggest:
- for base image, trying the openjdk (perhaps
openjdk:8-jdk-slim
) rather than airdock/oracle-jdk which at first glance looks like it would save 150MB (plus it would sharedebian:stretch-slim
withuhopper/hadoop
, if that were adjusted too) - not installing
less
- removing wget after using it to download the archives (making sure to use the
purge -y --auto-remove
form) - using
--no-install-recommends
with apt - cleaning up after apt (
rm -rf /var/lib/apt/lists/*
) - deleting the dockerize archive after downloading/extracting it
- using the
--no-cache-dir
option with pip, to prevent it from pointlessly caching the crudini download - either uninstalling pip after using it, or else manually installing crudini rather than using pip at all
- trying to reduce the duplicated file bloat in the presto archive (this issue is about the RPM, but the same applies to the tarball: prestodb/presto#6380). Ideally upstream would fix this, but failing that using rdfind's
-makehardlinks true
option would give huge savings (a quick test locally showed 70% uncompressed size reduction of the archive contents - from 538MB to 153MB!).
I did have a quick look to see if there were any official Docker images for the above, or better third-party images but didn't really find much which was quite surprising (some of the other third-party images were horrendous - 2GB+, deleting files in the layer after they were created etc).
For further analysis, I'd try using the microbadger tool (it's a bit flaky but is still helpful), eg: https://microbadger.com/images/mozdata/docker-presto
trying to reduce the duplicated file bloat in the presto archive
I've filed prestodb/presto#8904 to try and get upstream to fix this.
@wlach correct, the images mentioned above are only used for local testing.
There are still some suggestions in here we could use, so I'm going to reopen for now.
Totally agree with the necessity of using slim images in testing and production environments.