[WIP] Migrate htcondor runner to use the python htcondor bindings
Advantages:
- using official bindings and an event logging mechanism if needed
- no hand-crafted parsing of condor job logs
- less io (I hope)
- avoid subprocess usage
Disadvantage:
- potentially many (as many jobs as running) open file handles
- the wheel is 30MB
The general question is if event logging or polling or a mixture is the best approach for us. More information can be found here: https://htcondor.readthedocs.io/en/latest/apis/python-bindings/tutorials/Scalable-Job-Tracking.html
If the job is running longer than a day, we poll ones, to be save.
I appreciate any comments, as I'm not even sure the python bindings are any better than the current state.
@natefoo @jmchilton any comments are more than welcome :)
Wheel size could be addressed a bit by conditionally loading the dependency - like Kubernetes does I believe.
If there are open questions and you're not sure this will be the right approach - why not just write a second runner? It seems like a second condor runner would let us test this more without any chance of breaking things (existing condor installs, pulsar tests, etc..).
What about mismatch between condor version mismatch between the wheel and the condor service -- it was in the past that you needed a python-htcondor that matched your htcondor version specifically. Is that still an issue?
I'll push this to 21.05, but feel free to merge if you're happy. I do like John's suggestion of a new runner, if you're not certain yet.
A conditional dependency for this should be quite simple, see the one for pbs_python. Also, like this, pbs_python is linked to the specific version of Torque, so for this you can simply let the wheel be built from source on install.
no hand-crafted parsing of condor job logs
:heart_eyes:
Also, like this, pbs_python is linked to the specific version of Torque, so for this you can simply let the wheel be built from source on install.
Oh, that's fantastic, I'd avoided the python bindings for just parsing the command output for so long just because sometimes it was annoying to have to get a very specific version. This alleviates that concern.
:+1: if it's a new runner and we can have some time to test both? I'm using Condor in my new job, happy to test it out there.
Pushing milestone - seems like there is a plan there we just need someone to implement it. If it gets done before the freeze we can just readjust the milestone.