opam icon indicating copy to clipboard operation
opam copied to clipboard

`OpamDownload` assertion failure is causing opam-repo-ci builds to fail on arm32-ocaml-4.14

Open shonfeder opened this issue 1 year ago • 7 comments

First noticed (afaik) at https://github.com/ocaml/opam-repository/pull/25905#issuecomment-2119010020

The error we're seeing in CI is

/home/opam: (run (network host)
                 (shell "opam init --reinit --config .opamrc-sandbox -ni"))
Fatal error:
File "src/repository/opamDownload.ml", line 140, characters 2-8: Assertion failed
"/usr/bin/linux32" "/bin/sh" "-c" "opam init --reinit --config .opamrc-sandbox -ni" failed with exit status 99

which can be seen in, e.g., this CI log

The failing assertion is at

https://github.com/ocaml/opam/blob/391333d35bcdc8b55df709b876b8bafcf75f3452/src/repository/opamDownload.ml#L140

shonfeder avatar May 23 '24 20:05 shonfeder

is it reproducible or does it only happen from time to time?

kit-ty-kate avatar May 23 '24 22:05 kit-ty-kate

FWIW it also happened on the cmdliner release here.

dbuenzli avatar May 23 '24 22:05 dbuenzli

It's reproducible. E.g., every Jane Street package looks to be suffering the same fate currently: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/b0fb4f8c144e4e78cd6de1972fc3453a2024d8a8

shonfeder avatar May 24 '24 12:05 shonfeder

It seems to happen only on arm32 ~~& freebsd~~ images. If it is at repository reloading stage, it shouldn't go through that code as in the image it is defined as a directory (file:///home/opam/opam-repository). Is it possible to extract a backtrace and some logs (-vv | --debug)?

rjbou avatar May 24 '24 15:05 rjbou

I'll see about getting this reproducing net week. I also realized I didn't take into account the container caching when I claimed it is reproducible, and all of the CI jobs I've looked at so far are pulling that step from the cache.

shonfeder avatar May 25 '24 00:05 shonfeder

Trying to debug this without access to those machine has so far not produced any results. I've opened https://github.com/ocaml/opam/pull/5975 to at least show a more decent error message, which would help debug this further. My instinct tells me it is due to a file that is somehow removed on those arm machines but i'm still baffled as to why only arm (arm32 and arm64) machines are affected.

kit-ty-kate avatar May 27 '24 21:05 kit-ty-kate

The failure came from the fact that the image got broken somewhere and the $HOME directory was no longer readable, writeable or owned by the proper user.

The error message should be fixed though. I'm planning to open a more lightweight version of https://github.com/ocaml/opam/pull/5975 very soon to catch that sooner and display a better error message. I've removed this issue from the 2.2 board as it is no longer urgent.

kit-ty-kate avatar May 28 '24 14:05 kit-ty-kate