atlasrep website down results in GAP hanging
The CI issues are (were) caused by the atlasrep web site https://www.math.rwth-aachen.de/homes/Thomas.Breuer/atlasrep/ being down, which makes gap hang when trying to remotely fetch the atlas data. The website is back up now, but I find it quite shocking that a remote website being down can break running a local instance of GAP.
Ah, thanks. This explains why the error occurred without any changes on our side. Not sure where to report the underlying issue. @dimpase any idea?
Originally posted by @tobiasdiez in https://github.com/sagemath/sage/pull/39993#issuecomment-2823079716
Atlasrep should throw an error, not just hang.
Is it possible that this affects CI tests for packages?
Recently, some of my CI tests occasionally take ~15 mins instead of the usual ~1 min, even though the times given by TestDirectory did not increase. And this increase in time happens when GAP is started normally (which loads AtlasRep), but does not happen when it is started with the -A flag (which doesn't load AtlasRep).
See e.g. these affected and unaffected runs.
Thanks for the report.
It seems the website in Aachen was offline a couple of days ago which caused this? (Also the GAPDoc homepage was not available during that time.) Of course that can happen, but I agree this should not cause things to stall for so long.
Perhaps it is a matter of setting a timeout value somewhere?
CC @ThomasBreuer
Thomas and me discussed this, some things we plan to improve this:
On the technical side:
- the
Downloadfunction in the Utils package should be enhance with an (optional)timeoutargument - other
Downloadimplementations (in curlInterface and JuliaInterface) need to implement support for this - this then needs to be released
- atlasrep (and other users of Download) should possibly set a timeout
In addition, it would be good to have an active mirror of the atlasrep data. We can host it on e.g. gap-system.org (in Kaiserslautern). Once we that, atlasrep can be modified to try the mirror if it fails to reach the primary site.
This seems to be happening again as of yesterday. My CI tests with AtlasRep loaded took ~100 minutes to finish, without atlasrep <1 min. And indeed it seems that the AtlasRep and GAPDoc websites are down.
Indeed, there was a network problem with our webserver www.math.rwth-aachen.de (which is fixed now). But I don't understand why this is a problem for the CI tests. This webserver is only needed when a new version of the AtlasRep, GAPDoc, EDIM, ... packages needs to be fetched. As far as I can see, running AtlasRep was not effected by this problem. The Atlas data are on a different server atlas.math.rwth-aachen.de which was running without problems.
I suspect it may have been caused by the MFER and CTBlocks files then (lines 225 - 261 in userpref.g), which are hosted on math.rwth-aachen.de. If I remove my local copies of those files and change the URLs to nonsense in userpref.g, my CI tests give an error in the same place they would hang.
Thanks for this hint, I was not aware about these data from the www.math.rwth-aachen.de server.
It would be really good to have atlasrep more resilient to such outages (which can also happen even if the RWTH servers are up and running, due to network errors "in between" those servers and the user's host).
We really should start making use of timeouts -- getting https://github.com/gap-packages/utils/pull/81 finished and merged would go a great way towards this
We could also add mirrors for these data files and try those mirrors (either as a fallback; or possibly in random order; or something else)
- I have in mind mirroring on GitHub, on our own gap-system.org server in Kaiserslautern, or possibly a CDN like cloudflare...
This all also related to issue #4285 ...