galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

[WIP] Migrate htcondor runner to use the python htcondor bindings

Open bgruening opened this issue 4 years ago • 7 comments

Advantages:

  • using official bindings and an event logging mechanism if needed
  • no hand-crafted parsing of condor job logs
  • less io (I hope)
  • avoid subprocess usage

Disadvantage:

  • potentially many (as many jobs as running) open file handles
  • the wheel is 30MB

The general question is if event logging or polling or a mixture is the best approach for us. More information can be found here: https://htcondor.readthedocs.io/en/latest/apis/python-bindings/tutorials/Scalable-Job-Tracking.html

If the job is running longer than a day, we poll ones, to be save.

I appreciate any comments, as I'm not even sure the python bindings are any better than the current state.

bgruening avatar Dec 26 '20 20:12 bgruening

@natefoo @jmchilton any comments are more than welcome :)

bgruening avatar Dec 26 '20 20:12 bgruening

Wheel size could be addressed a bit by conditionally loading the dependency - like Kubernetes does I believe.

If there are open questions and you're not sure this will be the right approach - why not just write a second runner? It seems like a second condor runner would let us test this more without any chance of breaking things (existing condor installs, pulsar tests, etc..).

jmchilton avatar Dec 31 '20 04:12 jmchilton

What about mismatch between condor version mismatch between the wheel and the condor service -- it was in the past that you needed a python-htcondor that matched your htcondor version specifically. Is that still an issue?

hexylena avatar Jan 04 '21 09:01 hexylena

I'll push this to 21.05, but feel free to merge if you're happy. I do like John's suggestion of a new runner, if you're not certain yet.

mvdbeek avatar Jan 07 '21 16:01 mvdbeek

A conditional dependency for this should be quite simple, see the one for pbs_python. Also, like this, pbs_python is linked to the specific version of Torque, so for this you can simply let the wheel be built from source on install.

natefoo avatar Jan 29 '21 21:01 natefoo

no hand-crafted parsing of condor job logs

:heart_eyes:

Also, like this, pbs_python is linked to the specific version of Torque, so for this you can simply let the wheel be built from source on install.

Oh, that's fantastic, I'd avoided the python bindings for just parsing the command output for so long just because sometimes it was annoying to have to get a very specific version. This alleviates that concern.

:+1: if it's a new runner and we can have some time to test both? I'm using Condor in my new job, happy to test it out there.

hexylena avatar Feb 01 '21 08:02 hexylena

Pushing milestone - seems like there is a plan there we just need someone to implement it. If it gets done before the freeze we can just readjust the milestone.

jmchilton avatar Sep 08 '21 12:09 jmchilton