Scribe-Data
Scribe-Data copied to clipboard
Add Docker image file for CLI
Contributor checklist
- [x] This pull request is on a separate branch and not the main branch
Description
Related issue
Fixes - #150
Thank you for the pull request!
The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)
If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!
Maintainer checklist
-
[x] The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution
- The contributor's name and icon in remote commits should be the same as what appears in the PR
- If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for
git config user.emailin their local Scribe-Data repo
-
[x] The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
-
[ ] The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)
So here the next step is pushing docker image to the registry. From which docker id shall I push?
I'll send this one over to @wkyoshida as he's the expert here :) :)
So here the next step is pushing docker image to the registry. From which docker id shall I push?
hmm I guess we have some options. We could create a repository in Docker Hub for us or we could use GitHub's registry.
The latter could be cool since I believe it'd prominently show up under Packages in the repo - though this option does have a limit of 500MB for free use. If you build the image from the Dockerfile, @mhmohona , about how large is the resulting image?
The former would likely allow us to save several versions of the Docker image though, since we can go beyond 500MB
What's the plan for this one here? @mhmohona, are we ready to merge? Just checking in as you were saying it's hard for you to test this, so do you need support finishing this up?
So I was able to successfully test it in Github codespace. I think I can do it till the end.
Nice!! Are you able to see how large it ended up in GH Codespace? Also, curious if that was still including the below command?
RUN apt-get update && apt-get install -y \
build-essential \
pkg-config \
libicu-dev \
&& rm -rf /var/lib/apt/lists/*
For context @andrewtavis , building with the Dockerfile locally was bloating up. @mhmohona mentioned that the command above takes some time to complete. We discussed the idea of trying to install PyICU stuff with the Brewfile that we already have, i.e. brew bundle install --file=Brewfile
@mhmohona might have to run something like the below actually - now that I look at it again. Didn't explicitly try it myself though
COPY Brewfile /app/
RUN brew bundle install --file=Brewfile && \
export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \
export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"
Nice!! Are you able to see how large it ended up in GH Codespace? Also, curious if that was still including the below command?
RUN apt-get update && apt-get install -y \ build-essential \ pkg-config \ libicu-dev \ && rm -rf /var/lib/apt/lists/*For context @andrewtavis , building with the
Dockerfilelocally was bloating up. @mhmohona mentioned that the command above takes some time to complete. We discussed the idea of trying to install PyICU stuff with theBrewfilethat we already have, i.e.brew bundle install --file=Brewfile@mhmohona might have to run something like the below actually - now that I look at it again. Didn't explicitly try it myself though
COPY Brewfile /app/ RUN brew bundle install --file=Brewfile && \ export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \ export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"
So its the storage status -
I got error if i add Brewfile, keeping original file gave me following output -
and its the image in docker hub - https://hub.docker.com/repository/docker/mhmohona/scribe-data-cli I created it as test, will delete once everything is okay.
trying to pull, and then will run it locally.
@wkyoshida, what do you think about it?
and its the CLI in action -
I am not sure yet how to make only scribe-data list work instead of docker run mhmohona/scribe-data-cli list
Hmm - it seems that the image ends up pretty big..
What is the error with trying to use the Brewfile? I'm mostly curious if we can get that working, as I suspect using it alternatively could help with image size perhaps.
With the docker cli, you could also try inspecting details of the image and looking into the layer details to find out which layers are the large ones, i.e. responsible for the enlarged image size.
With that information, we can go from there and modify the Dockerfile to address the responsible ones.
Hmm - it seems that the image ends up pretty big.. What is the error with trying to use the
Brewfile? I'm mostly curious if we can get that working, as I suspect using it alternatively could help with image size perhaps.
got following error -
Its the output of docker history -
@mhmohona ➜ /workspaces/Scribe-Data (docker) $ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
scribe-data-cli latest 39ac5a96734c 2 days ago 8.59GB
mhmohona/scribe-data-cli latest 39ac5a96734c 2 days ago 8.59GB
@mhmohona ➜ /workspaces/Scribe-Data (docker) $ docker history 39ac5a96734c
IMAGE CREATED CREATED BY SIZE COMMENT
39ac5a96734c 2 days ago ENTRYPOINT ["python" "src/scribe_data/cli/ma… 0B buildkit.dockerfile.v0
<missing> 2 days ago ENV PYTHONPATH=/app/src 0B buildkit.dockerfile.v0
<missing> 2 days ago COPY /app /app # buildkit 440MB buildkit.dockerfile.v0
<missing> 2 days ago COPY /usr/local/bin /usr/local/bin # buildkit 28.8MB buildkit.dockerfile.v0
<missing> 2 days ago COPY /usr/local/lib/python3.9 /usr/local/lib… 7.99GB buildkit.dockerfile.v0
<missing> 2 days ago WORKDIR /app 0B buildkit.dockerfile.v0
<missing> 4 weeks ago CMD ["python3"] 0B buildkit.dockerfile.v0
<missing> 4 weeks ago RUN /bin/sh -c set -eux; savedAptMark="$(a… 10MB buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PYTHON_GET_PIP_SHA256=6fb7b781206356f45a… 0B buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PYTHON_GET_PIP_URL=https://github.com/py… 0B buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PYTHON_SETUPTOOLS_VERSION=58.1.0 0B buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PYTHON_PIP_VERSION=23.0.1 0B buildkit.dockerfile.v0
<missing> 4 weeks ago RUN /bin/sh -c set -eux; for src in idle3 p… 0B buildkit.dockerfile.v0
<missing> 4 weeks ago RUN /bin/sh -c set -eux; savedAptMark="$(a… 31.3MB buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PYTHON_VERSION=3.9.19 0B buildkit.dockerfile.v0
<missing> 4 weeks ago ENV GPG_KEY=E3FF2839C048B25C084DEBE9B26995E3… 0B buildkit.dockerfile.v0
<missing> 4 weeks ago RUN /bin/sh -c set -eux; apt-get update; a… 9.21MB buildkit.dockerfile.v0
<missing> 4 weeks ago ENV LANG=C.UTF-8 0B buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PATH=/usr/local/bin:/usr/local/sbin:/usr… 0B buildkit.dockerfile.v0
<missing> 4 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 4 weeks ago /bin/sh -c #(nop) ADD file:3d9897cfe027ecc7c… 74.8MB ```
<missing> 2 days ago COPY /usr/local/lib/python3.9 /usr/local/lib… 7.99GB buildkit.dockerfile.v0
hmm.. yeah this seems to be the big culprit (7.99GB). This would be the layer on L25 of the Dockerfile, and could be due to what is getting pip installed on L15. It would be interesting to see how large the layer for the pip install is. Notice though that we can't really see easily the layers from before L21, which is when the second stage of the Dockerfile starts. The deepest layer that we see in this output is of the layer on L22:
<missing> 2 days ago WORKDIR /app 0B buildkit.dockerfile.v0
One thing we could try for now is to actually remove the multi-stage aspect of the Dockerfile and just use a single stage. We'll be able to more easily analyze which layers are heavy in the builder stage.
got following error -
Yeah - what I had first sent you in Matrix is slightly off. Notice that it's using things like GITHUB_ENV AND GITHUB_PATH. Those are from when we need to install PyICU in the GitHub workflow (here).
The below might be closer to what we need, but I haven't tested it myself:
COPY Brewfile /app/
RUN brew bundle install --file=Brewfile && \
export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \
export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"
NOTE: This will likely need a step before to install brew though
Just a quick note for clarification -
All the layers further than
<missing> 2 days ago WORKDIR /app 0B buildkit.dockerfile.v0
are most likely those of the python:3.9-slim image itself that the Dockerfile is using as the base image.
Alrighty -
So in tweaking the Dockerfile and building the image myself, I'm seeing that the big culprit here really appears to be the pip install step in fact (this layer alone ends up at over 8GB in size :scream:). More specifically - in particular the tensorflow and torch requirements as they bring in pretty large dependencies. As I am understanding though..
tensorflow: we're probably good to just scratch it off therequirements.txt. We are no longer making use of it, is this correct?torch: we are still using or at least did still use it for the translations that were generated recently. However, as the idea anyways is to transition to Wiktionary-based translations, we are perhaps good with removing it from therequirements.txtas well soon?- bonus: as we're already having to look at the
requirements.txt, are there any other dependencies listed that we no longer make use of that we could scratch off too?
CC: @andrewtavis
Thank you for looking into it @wkyoshida.
I did also decide to actually go with removing the multi-stage aspect of the Dockerfile @mhmohona :grin:
The second stage seemed that it was already pretty much copying over most of what was being done in the builder stage anyways, so it made sense to me to simplify the Dockerfile in this way
Points on the above:
tensorflow: we're probably good to just scratch it off therequirements.txt. We are no longer making use of it, is this correct?
- Ya we can remove it :)
torch: we are still using or at least did still use it for the translations that were generated recently. However, as the idea anyways is to transition to Wiktionary-based translations, we are perhaps good with removing it from therequirements.txtas well soon?
- Yes, we'll be removing it as well. Maybe what I can do is do the data update for Scribe-iOS post fixing formatting to include lexeme IDs, and then from there I can remove the old translation process completely except for a reference in the CLI that says that it's WIP, and then I'll remove all of the above from the dependencies? This would take another few weeks, but then from there we'd be ready to do the pip and Dockerized releases 😊
- bonus: as we're already having to look at the
requirements.txt, are there any other dependencies listed that we no longer make use of that we could scratch off too?
pyarrow,sentencepiece,tabulateandtransformersare currently not used or are being used in translation. We need to make a decision on how the "currently available data table" is being made, as I believe that that's whytabulatewas in there. Specifically we're planning on hosting the PNG on Srcribe-Server, but are we generating it with Scribe-Data? How are we making that table PNG, anyway 🤔
Ya we can remove it :)
:rocket: - removed with cb29d7b
.. and then from there I can remove the old translation process ..
Ya that sounds good!
For references, I tested out building the image after removing the soon-to-scrapped dependencies. The pip install layer is currently at ~8GB. After removing..
tensorflow: went from ~8GB to ~6GBtorch: went from ~6GB to ~1GBpyarrow,sentencepiece,tabulateandtransformers: went from ~1GB to ~800MB
So yea - I think once we're able to scrap torch (in particular), we'll be in better shape
How are we making that table PNG, anyway 🤔
I was thinking of leaving it entirely to Scribe-Server actually; so eventually we could even scrap tabulate too. I'd say there's no rush for it though. We can continue to use tabulate for this and keep it around until then. Its footprint is fairly small; so no concern with Docker if it's still around.
tensorflow has been removed in a recent commit. I'll get rid of the rest after I do the final data update.
Note that this can be updated and merged once #292 has been finished as then all of the above dependencies will be gone :)
Alright the old translation process has been removed, so we should be good to work on this again without excessively large images :)
How are we doing here now? Do we need to include some documentation on how to get the container up and running? :)
So now it took 48 minute.
and it's docker history output -
(.venv) (base) mona@mhm:~/Desktop/github/Scribe-Data$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
scribe-data-cli latest 5dd362f0a3ec 10 minutes ago 3.09GB
(.venv) (base) mona@mhm:~/Desktop/github/Scribe-Data$ docker history 5dd362f0a3ec
IMAGE CREATED CREATED BY SIZE COMMENT
5dd362f0a3ec 10 minutes ago ENTRYPOINT ["python" "src/scribe_data/cli/ma… 0B buildkit.dockerfile.v0
<missing> 10 minutes ago ENV PYTHONPATH=/app/src 0B buildkit.dockerfile.v0
<missing> 10 minutes ago COPY . /app # buildkit 267MB buildkit.dockerfile.v0
<missing> 10 minutes ago RUN /bin/sh -c pip install --no-cache-dir -r… 2.27GB buildkit.dockerfile.v0
<missing> 53 minutes ago COPY requirements.txt /app/ # buildkit 339B buildkit.dockerfile.v0
<missing> 53 minutes ago RUN /bin/sh -c apt-get update && apt-get ins… 437MB buildkit.dockerfile.v0
<missing> 58 minutes ago WORKDIR /app 0B buildkit.dockerfile.v0
<missing> 6 days ago CMD ["python3"] 0B buildkit.dockerfile.v0
<missing> 6 days ago RUN /bin/sh -c set -eux; for src in idle3 p… 36B buildkit.dockerfile.v0
<missing> 6 days ago RUN /bin/sh -c set -eux; savedAptMark="$(a… 36.4MB buildkit.dockerfile.v0
<missing> 6 days ago ENV PYTHON_VERSION=3.13.0 0B buildkit.dockerfile.v0
<missing> 6 days ago ENV GPG_KEY=7169605F62C751356D054A26A821E680… 0B buildkit.dockerfile.v0
<missing> 6 days ago RUN /bin/sh -c set -eux; apt-get update; a… 9.24MB buildkit.dockerfile.v0
<missing> 6 days ago ENV PATH=/usr/local/bin:/usr/local/sbin:/usr… 0B buildkit.dockerfile.v0
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ADD file:a9a95cfab16803be0… 74.8MB
It's so much better than before. What do you think @wkyoshida?
Are we able to expand the readme and docs a bit with instructions to deploy the Docker setup, @mhmohona?