aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

Decrease `_DEFAULT_SAFE_OPEN_INTERVAL` (to what?)

Open GeigerJ2 opened this issue 1 year ago • 10 comments

A possible quick fix to not spend more of our time (at least for now...) on:

  • #6544
  • #6595
  • #6596

While this is quite a minor change, as it has rather far-reaching consequences, we should bikeshed this at least for a week or so :D Any thoughts on the duration, @sphuber, @mbercx, @unkcpz, @t-reents? @giovannipizzi proposed 3s, but I'm worried this is a bit short, so I put 5s for now. Will add some benchmarks here later on.

GeigerJ2 avatar Nov 04 '24 10:11 GeigerJ2

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 77.61%. Comparing base (9256f2f) to head (1020c84). :warning: Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6599      +/-   ##
==========================================
+ Coverage   77.59%   77.61%   +0.02%     
==========================================
  Files         566      566              
  Lines       43514    43547      +33     
==========================================
+ Hits        33760    33793      +33     
  Misses       9754     9754              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 04 '24 10:11 codecov[bot]

Why change this default? If I understand correctly, the main problem is with the SSH plugin, right? The LocalTransport already changes this default as well, since it doesn't really need one. So wouldn't it make more sense to limit the scope and change the default on SshTransport. That being said, the whole point is that this is a safe default, it can be configured by a user to something lower. So why not just keep that? They can simply run verdi computer configure core.ssh -n --safe-interval 5.

sphuber avatar Nov 04 '24 20:11 sphuber

Yes, good point, it should be changed in the SshTransport instead.

I think most users, especially new ones, will just be running with the default value. Especially if they go through verdi computer setup and then get

Report: Note: before the computer can be used, it has to be configured with the command:
Report:   verdi -p <profile> computer configure core.ssh <computer>

they'll likely just copy-paste this command. Instead, if they run verdi computer configure core.ssh --help they will probably be overwhelmed with the amount of options and overlook it (or not even read through them at all, due to laziness). Then, running short test jobs like the ArithmeticAdd will take longer than they expect, and they might think they did something wrong during their setup.

The question is, is 30s really necessary? And if it's overkill, why not reduce it.

GeigerJ2 avatar Nov 05 '24 07:11 GeigerJ2

The question is, is 30s really necessary? And if it's overkill, why not reduce it.

It is put at that value to prevent users being banned from compute centers for opening too many SSH connections. Admittedly, this is more likely to happen when you are running high-throughput with multiple daemon workers, and I am not sure if that is the main use case anymore for AiiDA users. So we can lower it, with the risk that some users may get warnings or banned. If @giovannipizzi also signed off on changing the default to favoring casual users over protecting heavy high-throughput users by default, then it is fine by me as well.

sphuber avatar Nov 05 '24 08:11 sphuber

In practice, as also the tests by @khsrali (i think in a different issue?) show, even with a very low value (like 0.1s, that I would not recommend) the number of open connections remains every low. This is because while a connection is not being opened, other tasks pile up and then reuse the same connection (@khsrali can you point to your tests?). For me 5s (or maybe even 3s) is OK, in any case a supercomputer center should not ban you if you connect every 5 seconds every now and then (e.g I could do it myself opening a few ssh shells). If I'm running mid throughput, requests will pile up and reuse the connection. If I'm really running high thoughput, probably I'll need to tweak a few options anyway. So I'd probably just check if 3s is too much once more and then change the default, and update the doc page on how to run high throughput, mentioning what this value is and that it might need to be modified to avoid being banned

giovannipizzi avatar Nov 05 '24 08:11 giovannipizzi

The benchmarks by @khsrali can be found here: https://github.com/aiidateam/aiida-core/issues/6544#issuecomment-2407529994

GeigerJ2 avatar Nov 05 '24 09:11 GeigerJ2

@GeigerJ2 , do you have a moment to apply the changes? Thanks a lot!

khsrali avatar Dec 18 '24 13:12 khsrali

Hi @khsrali, thanks for the ping here! Before we actually merge this change, I'd like to do some proper benchmarking (e.g., with Thor). That's why this has been on hold for a while now. Maybe we can do it together after the holidays? I remember you already did some benchmarks in the past 😉

GeigerJ2 avatar Dec 19 '24 16:12 GeigerJ2

@GeigerJ2 Is doing a benchmark is really necessarily for this? I suggest Just merge it, 5 second is long enough :smiling_face_with_tear:

khsrali avatar Jan 30 '25 10:01 khsrali

:smiling_face_with_tear: 🤌

khsrali avatar Apr 23 '25 16:04 khsrali