acl-anthology
acl-anthology copied to clipboard
Automated builds keep getting killed
We've long had issues with builds getting killed, presumably due to running out of memory in their container, but it only happened occasionally and was easily fixed by triggering a re-run. I have the impression that the problem is getting worse now.
-
The check-build action in #2798 is now running for the ~5th~6th time already, as all previous runs have been killed after a long run of Hugo building the site.
-
Quite a few publish actions are being killed too.
Presumably, the build will take more and more memory the bigger the Anthology gets, so we might have to do something about this eventually.
We could try to increase the swap size (https://github.com/pierotofy/set-swap-space)
Alternatively, gh has larger runners if we pay for it. Maybe @mjpost can tell us whether and if how much budget we have.
According to the pricing docs, this would be:
- 4$/month for a one-seat team
- 0.016$/min of compute (4 core 16gb) or 0.032$/min (8 core 32gb)
As the code is making use of parallelization, I assume that each PR check will then probably cost us around 20 cents.
6th attempt in a row at running check-build for #2798 has failed, I wonder if this PR has reached "critical mass" in terms of memory usage...
I can try the swap size thing, I have no idea if that might help or not, but thanks for the pointer!
I also wonder if we can't make use of caching to stop re-building the entire thing for every tiny change, but that's probably an orthogonal question, as we need to be able to make full builds from scratch anyway.
We can pay for this. ACL IT is looking into setting it up.
I do agree the builds feel unwieldy. It would be nice to avoid complete rebuilds for when a name changes. At the same time, we put this in place because occasionally a seemingly-innocuous change did break the preview way down the line.
Awesome, thanks Matt!
Of course we can be rather conservative with what parts we rebuild or reuse. But I believe Hugo already supports (and may even be optimized for) incremental builds, so caching the build output may be beneficial even if larger parts of the site change. In any case, I wonder if we can synchronize the check-build and preview actions better — if the parts that are specific to the preview (like the banner) could be factored out appropriately, one would think that the preview could be set to just run after & re-use the output from the check-build step, for example.
Another thing that might help memory usage is if we removed the bibliography files from the Hugo build chain and treated their links as external, like we do with the PDFs. That way, Hugo wouldn't know about them and track them, which – considering how many individual files those are – might also help things. That's maybe one of the simplest changes to try.
We are currently running two builds, mostly for historical reasons.
We should at least disable the check-build action, as it provides no additional info as far as I can see. Steps:
- @mjpost goes to the settings and changes the status check that needs to be passed to our preview action
- someone removes the check-build action
I am very much in favor of keeping the complete rebuild in place because otherwise we will run in very interesting states, e.g. when sharing state between different pull requests. I do not want to be the one debugging that in case something goes wrong. Additionally, the process will be much faster if we pay for bigger VMs.
Another thing that might help memory usage is if we removed the bibliography files from the Hugo build chain and treated their links as external
That sounds like a good idea and it would also enable us to remove the (bibliography -> hugo) dependency from the Makefile, resulting in a parallel of bibtex and hugo.
We should at least disable the check-build action, as it provides no additional info as far as I can see.
The preview action is not run on PRs that come from external repos, though.
Other than that, the only difference I can think of right now is that the preview doesn't build the entire BibTeX, only a small subset of it.
I am very much in favor of keeping the complete rebuild in place because otherwise we will run in very interesting states, e.g. when sharing state between different pull requests.
That's not possible AFAIK. Actions can only access caches from "parents", e.g. a PR could utilize cached state from the master branch, but not from, say, other PRs.
I think utilizing caches where possible would be the Right Thing™️ to do, but it might indeed be the case that we don't want to end up debugging this.
That sounds like a good idea and it would also enable us to remove the (bibliography -> hugo) dependency from the Makefile, resulting in a parallel of bibtex and hugo.
Technically yes, though I'm not sure that would gain us anything, as Hugo already utilizes all available cores itself.
The preview action is not run on PRs that come from external repos, though.
same for the check-build action.
Actions can only access caches from "parents", e.g. a PR could utilize cached state from the master branch, but not from, say, other PRs.
It would also require us to come up with a way to address the cache (i.e. to give it a unique ID) I am not sure how to best implement it without getting into potential problems (does our pipeline e.g. use mtime to check whether a file needs to be rebuilt?)
Technically yes, though I'm not sure that would gain us anything, as Hugo already utilizes all available cores itself.
This only faster in case there is some I/O bound processing.
The preview action is not run on PRs that come from external repos, though.
same for the check-build action.
The check-build action runs if the person has previously contributed, or it is enabled manually by one of us. It's required for merging after all. The preview action never runs.
But that is only because right now the check-build is marked as the protection. If we switched it to preview, we would run that instead (which makes sense because we want to see a preview before we merge)
I am happy to help set this up but according to this page, larger runners are only available for organizations and enterprises using the GitHub Team or GitHub Enterprise Cloud plans. I don't think the ACL account qualifies for this?
That sounds like the acl-org account would need to sign up for at least a "Team" subscription according to https://github.com/pricing, and then we would be able to select larger runners which are paid for by the minute, no? Disclaimer: I have never looked into this before.
This is my understanding as well
We could also try to first go with some other optimizations (swap, treat but bibtex as external files), but at some time we will need more ram.
The price for a team is 4$ per seat if I understand correctly, and we do not need seats, so we would habe 1 seat unless 0 is a possibility?
Aha, looks like we are already on a team plan but they are just not charging us for it! I just enabled large runners and made it available only to this repo. Can you guys try a build?
Thanks @desilinguist! I assume it's the one named ubuntu-latest-m? If I read this correctly we need to change your workflow files first to specify that runner; I can make a PR.
Actions in #2807 finished in record time :sunglasses:
Wow, fantastic.