pycbc
pycbc copied to clipboard
Add progress reporting via condor_chirp
@bbockelm clued me in to condor_chirp which CMS uses to send status messages back to the schedd that can be queried by the user.
It would be good to add chirp status messages to the codes that take a while so the user can see if they are stuck or making progress without doing ssh-ing into the node where they are running and poking around with ps.
The invocation we want to use is set_job_attr_delayed JobAttributeName AttributeValue as that is non-blocking. Unfortunately, there's no Python API for chirp, so we'll need to run it via system calls.
This should be done for
- [ ]
pycbc_inspiral - [ ]
pycbc_optimal_snr - [ ]
pycbc_compute_psd
Particularly, if you are invoking from C, here's the interface we use:
https://github.com/cms-sw/cmssw/blob/master/FWCore/Services/plugins/CondorStatusUpdater.cc#L308
Note that to reduce the stream of updates, we only send every N minutes at most.
There's a bit of trickiness around getting quoting correct - if you are actually invoking from python, then you might consider using the HTCondor python bindings to properly escape the strings.
CMS's updates basically include the following:
- A timestamp, so we can determine upstream when progress was last made.
- I/O layer instrumentation (files processed, bytes read, bytes written, number of operations, total time taken).
- Physics instrumentation (events / lumis / runs processed).
See also https://jira.isi.edu/browse/PM-1357
@duncan-brown Is this now done, or is should this remain open?