gap icon indicating copy to clipboard operation
gap copied to clipboard

atlasrep website down results in GAP hanging

Open dimpase opened this issue 8 months ago • 9 comments

The CI issues are (were) caused by the atlasrep web site https://www.math.rwth-aachen.de/homes/Thomas.Breuer/atlasrep/ being down, which makes gap hang when trying to remotely fetch the atlas data. The website is back up now, but I find it quite shocking that a remote website being down can break running a local instance of GAP.

Ah, thanks. This explains why the error occurred without any changes on our side. Not sure where to report the underlying issue. @dimpase any idea?

Originally posted by @tobiasdiez in https://github.com/sagemath/sage/pull/39993#issuecomment-2823079716

dimpase avatar Apr 23 '25 14:04 dimpase

Atlasrep should throw an error, not just hang.

dimpase avatar Apr 23 '25 14:04 dimpase

Is it possible that this affects CI tests for packages?

Recently, some of my CI tests occasionally take ~15 mins instead of the usual ~1 min, even though the times given by TestDirectory did not increase. And this increase in time happens when GAP is started normally (which loads AtlasRep), but does not happen when it is started with the -A flag (which doesn't load AtlasRep).

See e.g. these affected and unaffected runs.

stertooy avatar Apr 23 '25 15:04 stertooy

Thanks for the report.

It seems the website in Aachen was offline a couple of days ago which caused this? (Also the GAPDoc homepage was not available during that time.) Of course that can happen, but I agree this should not cause things to stall for so long.

Perhaps it is a matter of setting a timeout value somewhere?

CC @ThomasBreuer

fingolfin avatar Apr 28 '25 08:04 fingolfin

Thomas and me discussed this, some things we plan to improve this:

On the technical side:

  • the Download function in the Utils package should be enhance with an (optional) timeout argument
  • other Download implementations (in curlInterface and JuliaInterface) need to implement support for this
  • this then needs to be released
  • atlasrep (and other users of Download) should possibly set a timeout

In addition, it would be good to have an active mirror of the atlasrep data. We can host it on e.g. gap-system.org (in Kaiserslautern). Once we that, atlasrep can be modified to try the mirror if it fails to reach the primary site.

fingolfin avatar Apr 28 '25 08:04 fingolfin

This seems to be happening again as of yesterday. My CI tests with AtlasRep loaded took ~100 minutes to finish, without atlasrep <1 min. And indeed it seems that the AtlasRep and GAPDoc websites are down.

stertooy avatar Sep 02 '25 07:09 stertooy

Indeed, there was a network problem with our webserver www.math.rwth-aachen.de (which is fixed now). But I don't understand why this is a problem for the CI tests. This webserver is only needed when a new version of the AtlasRep, GAPDoc, EDIM, ... packages needs to be fetched. As far as I can see, running AtlasRep was not effected by this problem. The Atlas data are on a different server atlas.math.rwth-aachen.de which was running without problems.

frankluebeck avatar Sep 02 '25 08:09 frankluebeck

I suspect it may have been caused by the MFER and CTBlocks files then (lines 225 - 261 in userpref.g), which are hosted on math.rwth-aachen.de. If I remove my local copies of those files and change the URLs to nonsense in userpref.g, my CI tests give an error in the same place they would hang.

stertooy avatar Sep 02 '25 09:09 stertooy

Thanks for this hint, I was not aware about these data from the www.math.rwth-aachen.de server.

frankluebeck avatar Sep 02 '25 09:09 frankluebeck

It would be really good to have atlasrep more resilient to such outages (which can also happen even if the RWTH servers are up and running, due to network errors "in between" those servers and the user's host).

We really should start making use of timeouts -- getting https://github.com/gap-packages/utils/pull/81 finished and merged would go a great way towards this

We could also add mirrors for these data files and try those mirrors (either as a fallback; or possibly in random order; or something else)

  • I have in mind mirroring on GitHub, on our own gap-system.org server in Kaiserslautern, or possibly a CDN like cloudflare...

This all also related to issue #4285 ...

fingolfin avatar Sep 04 '25 08:09 fingolfin