DIRAC icon indicating copy to clipboard operation
DIRAC copied to clipboard

[9.0] feat(Resources): introduce fabric in SSHCE

Open aldbr opened this issue 1 year ago • 8 comments

Replace the Dirac-specific SSH class by fabric.

BEGINRELEASENOTES *Resources CHANGE: Replace SSH by fabric in SSHComputingElement ENDRELEASENOTES

aldbr avatar Jun 27 '24 07:06 aldbr

Just a note: we do not (yet) have a way to do proper integration test for the Computing Elements, but one may think about adding them to our integration tests setup. Something to think about it, it would be nice if it was in this PR. It involves creating the "site", with the "CE" (this would be yet another container) and the SiteDirector could send pilots to it.

fstagni avatar Jun 28 '24 15:06 fstagni

Just a note: we do not (yet) have a way to do proper integration test for the Computing Elements, but one may think about adding them to our integration tests setup. Something to think about it, it would be nice if it was in this PR. It involves creating the "site", with the "CE" (this would be yet another container) and the SiteDirector could send pilots to it.

I agree it would great to add integration tests for CEs, at least to test basic features. But it will likely become complex because:

  • if we want to test things properly, we need to set up a CE and a Batch System.
  • we will have to choose one configuration, but it might not reflect the configuration of the sites in production.

I will give it a try with the SSHCE, let's see.

aldbr avatar Jul 25 '24 15:07 aldbr

I wonder if it really makes sense to add CEs (and Batch Systems) in the integration tests: while it would be great to have a "grid in a box" in a controller environment, it would be cumbersome to maintain on the long term and would not be representative of all the instances we can find out there (e.g. Arc v6, v6 with a hack, v7, transferring jobs to Slurm, HTCondor, SSH, SSH tunnel, HTCondor with local scheduler, with remote scheduler...).

It would probably make more sense to add some scripts to run during the hackathons. For each type of CE supported it would:

  • get all the instances related to the given type of CE and for each of them:
    • submit a "hello world" job
    • get the CE status
    • get the job status until it reaches a final state
    • get the job output and logging info (if available)

Basically, it would be very similar to (i) submitting pilots with the Site Director and (ii) checking their results manually. But it would be more focused on the CE interfaces and would be more automated (though a human would need to check whether errors come from the CE instance itself or the Dirac CE interface).

Any opinion @fstagni ?

aldbr avatar Nov 29 '24 10:11 aldbr

I think the only one that would make sense to set up here is the SSHCE. The others, "proper Grid ones", can not be tested here.

fstagni avatar Nov 29 '24 10:11 fstagni

I don't even know if testing SSHCE in an integration test makes sense. The only easy test we can set up would be SSHCE + Host, which is not representative of what we can have in production.

aldbr avatar Nov 29 '24 11:11 aldbr

OK OK, give up on the idea...

fstagni avatar Nov 29 '24 12:11 fstagni

I will add a certification test focused on the CE interfaces as I explained (+ a card in the kanban board to explain how to execute it). I will execute it in the lhcb environment to make sure the changes in this PR are correct.

And I can also try to add a container that would act as a "Site" and use SSH + Host so that we can at least test the Site Director "in a box". Would it be okay?

aldbr avatar Nov 29 '24 13:11 aldbr

Sure, thanks

fstagni avatar Nov 29 '24 13:11 fstagni