datasette
datasette copied to clipboard
Explore if SquashFS can be used to shrink size of packaged Docker containers
Inspired by this article: https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html#sqlite-database-indexed--squashed
https://en.wikipedia.org/wiki/SquashFS is "a compressed read-only file system for Linux" - which means it could be a really nice fit for Datasette and its read-only SQLite databases.
It would be interesting to explore a Dockerfile recipe that used SquashFS to compress the SQLite database file that was bundled up by datasette package
and friends.
Relevant: https://code.fb.com/data-infrastructure/xars-a-more-efficient-open-source-system-for-self-contained-executables/
See https://github.com/simonw/datasette/issues/657 and my changes that allow datasette to load parquet files
On fly.io. This particular database goes from 1.4GB to 200M. Slower, part of that might be having no --inspect-file
?
$ datasette publish fly ... --generate-dir /tmp/deploy-this
...
$ mksquashfs large.db large.squashfs
$ rm large.db # don't accidentally put it in the image
$ cat Dockerfile
FROM python:3.8
COPY . /app
WORKDIR /app
ENV DATASETTE_SECRET 'xyzzy'
RUN pip install -U datasette
# RUN datasette inspect large.db --inspect-file inspect-data.json
ENV PORT 8080
EXPOSE 8080
CMD mount -o loop -t squashfs large.squashfs /mnt; datasette serve --host 0.0.0.0 -i /mnt/large.db --cors --port $PORT
It would also be possible to copy the file onto the ~6GB available on the ephemeral container filesystem on startup. A little against the spirit of the thing? On this example the whole docker image is 2.42 GB and the squashfs version is 1.14 GB.
On second thought any kind of quick-to-decompress-on-startup could be helpful if we're paying for the container registry and deployment bandwidth but not ephemeral storage.