`OpamDownload` assertion failure is causing opam-repo-ci builds to fail on arm32-ocaml-4.14
First noticed (afaik) at https://github.com/ocaml/opam-repository/pull/25905#issuecomment-2119010020
The error we're seeing in CI is
/home/opam: (run (network host)
(shell "opam init --reinit --config .opamrc-sandbox -ni"))
Fatal error:
File "src/repository/opamDownload.ml", line 140, characters 2-8: Assertion failed
"/usr/bin/linux32" "/bin/sh" "-c" "opam init --reinit --config .opamrc-sandbox -ni" failed with exit status 99
which can be seen in, e.g., this CI log
The failing assertion is at
https://github.com/ocaml/opam/blob/391333d35bcdc8b55df709b876b8bafcf75f3452/src/repository/opamDownload.ml#L140
is it reproducible or does it only happen from time to time?
FWIW it also happened on the cmdliner release here.
It's reproducible. E.g., every Jane Street package looks to be suffering the same fate currently: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/b0fb4f8c144e4e78cd6de1972fc3453a2024d8a8
It seems to happen only on arm32 ~~& freebsd~~ images. If it is at repository reloading stage, it shouldn't go through that code as in the image it is defined as a directory (file:///home/opam/opam-repository). Is it possible to extract a backtrace and some logs (-vv | --debug)?
I'll see about getting this reproducing net week. I also realized I didn't take into account the container caching when I claimed it is reproducible, and all of the CI jobs I've looked at so far are pulling that step from the cache.
Trying to debug this without access to those machine has so far not produced any results. I've opened https://github.com/ocaml/opam/pull/5975 to at least show a more decent error message, which would help debug this further. My instinct tells me it is due to a file that is somehow removed on those arm machines but i'm still baffled as to why only arm (arm32 and arm64) machines are affected.
The failure came from the fact that the image got broken somewhere and the $HOME directory was no longer readable, writeable or owned by the proper user.
The error message should be fixed though. I'm planning to open a more lightweight version of https://github.com/ocaml/opam/pull/5975 very soon to catch that sooner and display a better error message. I've removed this issue from the 2.2 board as it is no longer urgent.