datasette icon indicating copy to clipboard operation
datasette copied to clipboard

Explore if SquashFS can be used to shrink size of packaged Docker containers

Open simonw opened this issue 6 years ago • 4 comments

Inspired by this article: https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html#sqlite-database-indexed--squashed

https://en.wikipedia.org/wiki/SquashFS is "a compressed read-only file system for Linux" - which means it could be a really nice fit for Datasette and its read-only SQLite databases.

It would be interesting to explore a Dockerfile recipe that used SquashFS to compress the SQLite database file that was bundled up by datasette package and friends.

simonw avatar Jun 24 '18 18:06 simonw

Relevant: https://code.fb.com/data-infrastructure/xars-a-more-efficient-open-source-system-for-self-contained-executables/

simonw avatar Jul 13 '18 18:07 simonw

See https://github.com/simonw/datasette/issues/657 and my changes that allow datasette to load parquet files

dazzag24 avatar Feb 11 '20 14:02 dazzag24

On fly.io. This particular database goes from 1.4GB to 200M. Slower, part of that might be having no --inspect-file?

$ datasette publish fly ...   --generate-dir /tmp/deploy-this
...
$ mksquashfs large.db large.squashfs
$ rm large.db # don't accidentally put it in the image
$ cat Dockerfile
FROM python:3.8
COPY . /app
WORKDIR /app

ENV DATASETTE_SECRET 'xyzzy'
RUN pip install -U datasette
# RUN datasette inspect large.db --inspect-file inspect-data.json
ENV PORT 8080
EXPOSE 8080
CMD mount -o loop -t squashfs large.squashfs /mnt; datasette serve --host 0.0.0.0 -i /mnt/large.db --cors --port $PORT

It would also be possible to copy the file onto the ~6GB available on the ephemeral container filesystem on startup. A little against the spirit of the thing? On this example the whole docker image is 2.42 GB and the squashfs version is 1.14 GB.

dholth avatar Feb 17 '22 23:02 dholth

On second thought any kind of quick-to-decompress-on-startup could be helpful if we're paying for the container registry and deployment bandwidth but not ephemeral storage.

dholth avatar Feb 17 '22 23:02 dholth