datatracker
datatracker copied to clipboard
Provide `rsync`'ed assets via a container image
Description
I recently started rsyncing the extra assets via docker/scripts/app-rsync-extras.sh, which places a lot of files into data/developers. That caused parcel to run extremely slowly inside the container, due to https://github.com/parcel-bundler/parcel/issues/8128.
Would it be possible to provide those extra assets via a CI-built container image that we hook into the docker-compose setup? That way, they wouldn't need to be under /workspace and hence would not slow parcel down.
An alternative would be to go somehow re-enable running yarn and parcel on the host rather than in the container, which is what my original Dockerfile did, but I think @NGPixel found some downsides with that?
Code of Conduct
- [X] I agree to follow the IETF's Code of Conduct
most developers wouldn't have parcel/yarn on their host. We already have an issue for building better resources for "all the things" or "enough of the things". A mount that can be attached to a pre-built container makes sense, but we should offer lighterweight containers than "everything" for people on less capable development boxes or behind slow networks. We should also address having the full, or a lighterweight, development database image.
I'm not sure I understand the extra container image for assets? The assets can be stored outside the /workspace directory. That location can be mounted from the host as well. (I don't think Github would appreciate us generating multi-gigabytes images)
You can run yarn / parcel on the host (it was never disabled). As long as you have node + yarn installed on your machine. Then run yarn rebuild, followed by yarn dev.
They can be, but by default they are not. We should maybe change the default.
Thanks for the tip about yarn/parcel!
The database image is already multiple gigabytes.
The raw data for all these artifacts is completely dominated by the id-archive which is 12GB of mostly sparse text. (Don't images take advantage of compression?)
If this really turned into an issue for github, we could look at storing images elsewhere.
The goal is to help bootstrap developers. If we do that by changing the mountpoints and having them run rsync on their own, I'm ok with that. But I still think having the ability for them to get 'reasonably enough' vs 'everything' is worth pursuing, both for this and the developer database.
GitHub doesn't have storage limits on public repos and our current image size is still very acceptable IMO. My issue was with having 20GB images generated nightly.
I'm going to add a volume at /assets and change the settings / scripts to point to that location.
Where is the threshold for the nightly generation image size concern? We're building the db image nightly...
@rjsparks Fair point. I just see the assets as being on a different scale when it comes to size.
@larseggert PR #4018 should address this issue.
You can run yarn / parcel on the host (it was never disabled). As long as you have node + yarn installed on your machine. Then run
yarn rebuild, followed byyarn dev.
I have all that installed, but yarn rebuild on the host fails:
➤ YN0000: ┌ Resolution step
➤ YN0000: └ Completed
➤ YN0000: ┌ Fetch step
➤ YN0000: └ Completed in 0s 248ms
➤ YN0000: ┌ Link step
➤ YN0000: │ ESM support for PnP uses the experimental loader API and is therefore experimental
➤ YN0007: │ msgpackr-extract@npm:2.0.2 must be built because it never has been before or the last one failed
➤ YN0007: │ lmdb@npm:2.2.4 must be built because it never has been before or the last one failed
➤ YN0007: │ @parcel/watcher@npm:2.0.5 must be built because it never has been before or the last one failed
➤ YN0007: │ cypress@npm:9.7.0 must be built because it never has been before or the last one failed
➤ YN0009: │ lmdb@npm:2.2.4 couldn't be built successfully (exit code 1, logs can be found here: /private/var/folders/md/d7qn8x511850pjd3m956gz6r0000gn/T/xfs-1200a0de/build.log)
➤ YN0000: └ Completed in 2s 100ms
➤ YN0000: Failed with errors in 2s 457ms
Can you post the log file it points to with the error? It could be a missing native build dependency.
Not before next week. IIRC it was something about gyp.
I think we have a named volume now but by default it is empty, yes? Reopening this to continue the conversation about whether we provide a nightly build of a populated image to use instead (or as an alternative)
Yes, the volume is empty (apart from the all the directories created from the container init script).
We could provide a prepopulated image as well, but that would be for non-vscode users only as I don't see an easy way to provide 2 distincts dev environments where the user can choose, without having them manually edit the .devcontainer.json config.
However, is that really more convenient than using rsync inside the container as needed? By having an image, you would need to re-download the full 12GB layer everytime you want the latest contents, vs using rsync to only download what has changed?
if we clean up the rync, then maybe it's where to focus (we'd need to do that before building a populated image anyhow). But I still worry about the volume being cleaned away by, say, pruning, when trying to rebuild images/containers as head moves along.
Why wouldn't we always provide the full assets image?
Manually filling an empty volume and then making sure it doesn't get pruned is way too much work.
There's a good argument for both - the middle ground is that I start with a reasonably populated volume (not empty) and then keep it up to date with rsync (but I have the "doesn't get pruned" issue). If we assume people are going to get the volume from scratch every week or more frequently, we're putting quite a burden on their bandwidth (and our use of github if we get to the point where we have more than a dozen active developers.
Maybe we could include only the assets from the past X month(s) in the image? And if you need everything, you can rsync the rest?
Is there an easy way to only fetch assets created after a certain date via rsync?
How large is the full asset image? Am still unconvinced it would be a problem for GitHub if we just provided that.
I see several issues with including everything in an image:
-
From what I read, Docker has a default image size limit of 10GB. While it can be increased, you have to manually change the value in the docker settings.
-
Building such an image would take forever and it might not even succeed at all. To be tested but I wouldn't be surprised if the CI build would crash or run out of memory.
-
The same is true for pulling and creating a container out of the image. There's a good chance it could fail for some users because of the size.
-
Any dev that wants to use the assets would need to download a massive image. Bandwidth / download speed might be limited for some users.
Docker images are simply not meant to be a medium for large data storage.
I would instead suggest the following:
-
Create a CI workflow which would start an empty container, fetch the assets using rsync to a volume, export the volume to a
.tgzfile and make it available for download somewhere. -
Make the assets volume external. This means the volume is no longer part of the docker-compose project and must be created / managed manually. This gives us much greater control over its lifecycle. It should also prevent the volume from being pruned by the
docker compose downcommand (I think? to be tested...) -
Create a
setupscript to prompt the user with the choice to either use the minimal volume or full assets volume:- Minimal volume: An empty assets volume is created with the base structure.
- Full assets volume: The .tgz file generated earlier is fetched. An assets volume is then created from this archive.
-
Depending on your editor:
-
non-vscode: Use the
runscript as usual. If the user doesn't runsetupfirst or the assets volume doesn't exist, it will call thesetupscript so that it can be created. -
vscode: You would run the
setupscript to create the desired volume, then open the devcontainer as usual. If you open the devcontainer without runningsetupfirst or the assets volume is not found, an empty assets volume would automatically be created via theinitializeCommandhook.
-
I believe this would address all scenarios and make the assets more manageable.
Is the full image even 10GB?
Also, isn't it possible to keep the rsynced stuff around as a build artifact, so we only need to rsync the diffs for future builds of the image?
I work with many large images and have never seen problems. I suggest we try that, since it seems simplest to set up, and change to something more elaborate if needed?
@larseggert As expected, the docker build process runs out of space during rsync and crash:
https://github.com/ietf-tools/datatracker/actions/runs/2441032900
Multiple images per asset type?
That wouldn't work as you can't extend from multiple images at once.
Using a volume is really the proper way to go here...
Sorry, I meant multiple volumes (one for each asset type).
Also, in order shrink the image, could we omit rsyncing the pdf and html renderings of I-Ds and RFCs, and maybe omit the json and p7s files?
Lars - this probably needs real-time interaction - would you like to schedule a call before 114 or discuss this during the 114 sprint?
I've been looking at trying to do this by docker volume and am becoming increasingly convinced that it's the wrong approach - what we really should be doing is grooming a better (and faster) rsync point (or it's technical equivalent). And in reality, that's probably a few separate points for just how much you want to have locally.
We really do have contributors who have machines that cannot hold the full set of assets.
I'd also like to discuss a future where we access the assets from the datatracker code through an interface rather than raw file opens. Such an interface could be taught, in development environments, to fetch things it doesn't have as it needs them.
Remember to make this a discussion point at the 114 sprint. If it turns out that having remote participants makes that hard, we'll schedule a dedicated meeting to it afterwards.
This lingers.
I think we need to shift the conversation to what the problem being solved really is and then return to specific mechanisms.
I think Lars' pain-point is "It takes too long to get all the artifacts", and we can talk about how to deal with that.
But most contributors do not want (and definitely do not need) all the artifacts. We should also be talking about how to get the artifacts they really need, not just "all". The solutions for that will be different than forcing everyone to deal with another medium-to-largish sized download.
I'd be fine with "all artifacts in the last two years" or so.