acl-anthology Automated builds keep getting killed

We've long had issues with builds getting killed, presumably due to running out of memory in their container, but it only happened occasionally and was easily fixed by triggering a re-run. I have the impression that the problem is getting worse now.

The check-build action in #2798 is now running for the ~5th~6th time already, as all previous runs have been killed after a long run of Hugo building the site.
Quite a few publish actions are being killed too.

Presumably, the build will take more and more memory the bigger the Anthology gets, so we might have to do something about this eventually.

Sep 19 '23 08:09 mbollmann

We could try to increase the swap size (https://github.com/pierotofy/set-swap-space)

Alternatively, gh has larger runners if we pay for it. Maybe @mjpost can tell us whether and if how much budget we have.

Sep 19 '23 10:09 akoehn

According to the pricing docs, this would be:

4$/month for a one-seat team
0.016$/min of compute (4 core 16gb) or 0.032$/min (8 core 32gb)

As the code is making use of parallelization, I assume that each PR check will then probably cost us around 20 cents.

Sep 19 '23 10:09 akoehn

6th attempt in a row at running check-build for #2798 has failed, I wonder if this PR has reached "critical mass" in terms of memory usage...

I can try the swap size thing, I have no idea if that might help or not, but thanks for the pointer!

I also wonder if we can't make use of caching to stop re-building the entire thing for every tiny change, but that's probably an orthogonal question, as we need to be able to make full builds from scratch anyway.

Sep 19 '23 13:09 mbollmann

We can pay for this. ACL IT is looking into setting it up.

I do agree the builds feel unwieldy. It would be nice to avoid complete rebuilds for when a name changes. At the same time, we put this in place because occasionally a seemingly-innocuous change did break the preview way down the line.

Sep 19 '23 13:09 mjpost

Awesome, thanks Matt!

Of course we can be rather conservative with what parts we rebuild or reuse. But I believe Hugo already supports (and may even be optimized for) incremental builds, so caching the build output may be beneficial even if larger parts of the site change. In any case, I wonder if we can synchronize the check-build and preview actions better — if the parts that are specific to the preview (like the banner) could be factored out appropriately, one would think that the preview could be set to just run after & re-use the output from the check-build step, for example.

Sep 19 '23 13:09 mbollmann

Another thing that might help memory usage is if we removed the bibliography files from the Hugo build chain and treated their links as external, like we do with the PDFs. That way, Hugo wouldn't know about them and track them, which – considering how many individual files those are – might also help things. That's maybe one of the simplest changes to try.

Sep 19 '23 13:09 mbollmann

We are currently running two builds, mostly for historical reasons.

We should at least disable the check-build action, as it provides no additional info as far as I can see. Steps:

@mjpost goes to the settings and changes the status check that needs to be passed to our preview action
someone removes the check-build action

I am very much in favor of keeping the complete rebuild in place because otherwise we will run in very interesting states, e.g. when sharing state between different pull requests. I do not want to be the one debugging that in case something goes wrong. Additionally, the process will be much faster if we pay for bigger VMs.

Sep 19 '23 14:09 akoehn

Another thing that might help memory usage is if we removed the bibliography files from the Hugo build chain and treated their links as external

That sounds like a good idea and it would also enable us to remove the (bibliography -> hugo) dependency from the Makefile, resulting in a parallel of bibtex and hugo.

Sep 19 '23 14:09 akoehn

We should at least disable the check-build action, as it provides no additional info as far as I can see.

The preview action is not run on PRs that come from external repos, though.

Other than that, the only difference I can think of right now is that the preview doesn't build the entire BibTeX, only a small subset of it.

I am very much in favor of keeping the complete rebuild in place because otherwise we will run in very interesting states, e.g. when sharing state between different pull requests.

That's not possible AFAIK. Actions can only access caches from "parents", e.g. a PR could utilize cached state from the master branch, but not from, say, other PRs.

I think utilizing caches where possible would be the Right Thing™️ to do, but it might indeed be the case that we don't want to end up debugging this.

That sounds like a good idea and it would also enable us to remove the (bibliography -> hugo) dependency from the Makefile, resulting in a parallel of bibtex and hugo.

Technically yes, though I'm not sure that would gain us anything, as Hugo already utilizes all available cores itself.

Sep 19 '23 14:09 mbollmann

The preview action is not run on PRs that come from external repos, though.

same for the check-build action.

Actions can only access caches from "parents", e.g. a PR could utilize cached state from the master branch, but not from, say, other PRs.

It would also require us to come up with a way to address the cache (i.e. to give it a unique ID) I am not sure how to best implement it without getting into potential problems (does our pipeline e.g. use mtime to check whether a file needs to be rebuilt?)

Technically yes, though I'm not sure that would gain us anything, as Hugo already utilizes all available cores itself.

This only faster in case there is some I/O bound processing.

Sep 19 '23 15:09 akoehn

The preview action is not run on PRs that come from external repos, though.

same for the check-build action.

The check-build action runs if the person has previously contributed, or it is enabled manually by one of us. It's required for merging after all. The preview action never runs.

Sep 20 '23 07:09 mbollmann

But that is only because right now the check-build is marked as the protection. If we switched it to preview, we would run that instead (which makes sense because we want to see a preview before we merge)

Sep 20 '23 11:09 akoehn

I am happy to help set this up but according to this page, larger runners are only available for organizations and enterprises using the GitHub Team or GitHub Enterprise Cloud plans. I don't think the ACL account qualifies for this?

Sep 24 '23 15:09 desilinguist

That sounds like the acl-org account would need to sign up for at least a "Team" subscription according to https://github.com/pricing, and then we would be able to select larger runners which are paid for by the minute, no? Disclaimer: I have never looked into this before.

Sep 24 '23 18:09 mbollmann

This is my understanding as well

We could also try to first go with some other optimizations (swap, treat but bibtex as external files), but at some time we will need more ram.

The price for a team is 4$ per seat if I understand correctly, and we do not need seats, so we would habe 1 seat unless 0 is a possibility?

Sep 24 '23 18:09 akoehn

Aha, looks like we are already on a team plan but they are just not charging us for it! I just enabled large runners and made it available only to this repo. Can you guys try a build?

Sep 24 '23 23:09 desilinguist

Thanks @desilinguist! I assume it's the one named ubuntu-latest-m? If I read this correctly we need to change your workflow files first to specify that runner; I can make a PR.

Sep 25 '23 09:09 mbollmann

Actions in #2807 finished in record time :sunglasses:

Sep 25 '23 09:09 mbollmann

Wow, fantastic.

Sep 25 '23 10:09 mjpost

acl-anthology acl-anthology copied to clipboard

Automated builds keep getting killed

acl-anthology
acl-anthology copied to clipboard