setup-miniconda icon indicating copy to clipboard operation
setup-miniconda copied to clipboard

Retry on CondaHTTPError

Open jaimergp opened this issue 4 years ago • 12 comments

Sometimes connection to the Anaconda server fails and a simple retry fixes it:

Example

  CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/bioconda/noarch/current_repodata.json>
  Elapsed: -
  
  An HTTP error occurred when trying to retrieve this URL.
  HTTP errors are often intermittent, and a simple retry will get you on your way.
  'https://conda.anaconda.org/bioconda/noarch'

If you think implementing this logic in the action (like three attempts, 10 seconds apart) is a good idea, I'll be happy to implement it once the refactor is done!

jaimergp avatar Dec 23 '20 13:12 jaimergp

Are there better config defaults we could set that would allow conda to do this for us? I know there are a bunch of http settings... would certainly behoove us to at least raise the cache time for repodata. I really don't like parsing random string outputs, e.g. what does mamba throw in a similar situation?

bollwyvl avatar Dec 24 '20 19:12 bollwyvl

Some of these https://docs.conda.io/projects/conda/en/latest/configuration.html

# # remote_connect_timeout_secs (float)
# #   The number seconds conda will wait for your client to establish a
# #   connection to a remote url resource.
# # 
# remote_connect_timeout_secs: 9.15

# # remote_max_retries (int)
# #   The maximum number of retries each HTTP connection should attempt.
# # 
# remote_max_retries: 3

# # remote_backoff_factor (int)
# #   The factor determines the time HTTP connection should wait for
# #   attempt.
# # 
# remote_backoff_factor: 1

# # remote_read_timeout_secs (float)
# #   Once conda has connected to a remote resource and sent an HTTP
# #   request, the read timeout is the number of seconds conda will wait for
# #   the server to send a response.
# # 
# remote_read_timeout_secs: 60.0

goanpeca avatar Dec 24 '20 19:12 goanpeca

Oh, I wasn't aware of those settings. Yep, providing other defaults that suit this use case better seems like a better idea!

jaimergp avatar Dec 25 '20 15:12 jaimergp

hey guys, happy new year! We have started seeing the CondaHTTP errors more and more within the past few days, nightly tests that used to complete with no issues are now dying like flies because of 500 and its variations types of remote server issues (see example) - do you guys know what's going on? Anything you can recommend for me to do to avert this? - it seems that it's clearly a conda issue (SSL setup maybe?). Cheers muchly in advance, and keep up the great work @goanpeca et al :beer:

valeriupredoi avatar Jan 07 '21 12:01 valeriupredoi

@mattwthompson is also reporting issues like these at openforcefield, I believe!

jaimergp avatar Jan 07 '21 12:01 jaimergp

We are running the tests nightly at 00:00 UTC, but I have noticed that tests run at other times than midnight UTC pass more frequently, maybe Anaconda is suffering from connection issues around 00:00 UTC? Someone playing Cyberpunk at night at Anaconda and hogging the bandwidth? :rofl:

valeriupredoi avatar Jan 07 '21 12:01 valeriupredoi

Yeah, this is your free-as-in-beer resource have capacity issues. Us adding more retry would likely exacerbate the issue.

Really, we need to figure out how to better enable good caching, potentially making use thereof a hard requirement in v3.

bollwyvl avatar Jan 07 '21 12:01 bollwyvl

I'd much rather hammer the resources a little harder and feel guilty about doing so than have to manually restart jobs on a regular basis - sometimes our cron jobs almost all fail because of HTTP errors when pulling from our non-defaults/conda-forge channel. If I'm understanding the docs correctly, @jaimergp's suggestion of retrying a few times after a few seconds seems to already be the default?

$ conda config --show | grep remote
remote_backoff_factor: 1
remote_connect_timeout_secs: 9.15
remote_max_retries: 3
remote_read_timeout_secs: 60.0

I think accessing these settings from the action would provide a better solution than i.e. running a conda config step outside the action or trying to run cron jobs at particular times of the day

mattwthompson avatar Jan 07 '21 15:01 mattwthompson

hey guys, changing the time the scheduled test is running at did the trick for us (at least for the first time it ran at 4am UTC instead of midnight UTC last night) - might actually be something going on on the Anaconda servers :+1:

valeriupredoi avatar Jan 12 '21 10:01 valeriupredoi

Us adding more retry would likely exacerbate the issue.

@bollwyvl how do you figure? There's no way to re-run a single failed job in a GitHub action workflow, so one failed job means re-running the entire workflow with all 10 or 15 or however many jobs (with all their installs) until things pass. That seems worse (I'd happily just re-run single failed jobs manually if GH allowed)

bryevdv avatar May 24 '21 22:05 bryevdv

There's no way to re-run a single failed job in a GitHub action workflow, so one failed job means re-running the entire workflow with all 10 or 15 or however many jobs (with all their installs) until things pass. That seems worse (I'd happily just re-run single failed jobs manually if GH allowed)

I think this is no longer the case @bryevdv, so the situation has improved a bit with that inclusion.


@conda-incubator/setup-miniconda do you think we should expose these options? @bollwyvl, @jaimergp ?

goanpeca avatar May 26 '22 19:05 goanpeca

Yes, on GitHub you can now re-run only the failed jobs, rather than having to restart every job. So this occurrence is a significantly less impactful problem now, in the context of GitHub actions.

bryevdv avatar May 27 '22 00:05 bryevdv