pycbc icon indicating copy to clipboard operation
pycbc copied to clipboard

Add progress reporting via condor_chirp

Open duncan-brown opened this issue 8 years ago • 3 comments
trafficstars

@bbockelm clued me in to condor_chirp which CMS uses to send status messages back to the schedd that can be queried by the user.

It would be good to add chirp status messages to the codes that take a while so the user can see if they are stuck or making progress without doing ssh-ing into the node where they are running and poking around with ps.

The invocation we want to use is set_job_attr_delayed JobAttributeName AttributeValue as that is non-blocking. Unfortunately, there's no Python API for chirp, so we'll need to run it via system calls.

This should be done for

  • [ ] pycbc_inspiral
  • [ ] pycbc_optimal_snr
  • [ ] pycbc_compute_psd

duncan-brown avatar Aug 16 '17 11:08 duncan-brown

Particularly, if you are invoking from C, here's the interface we use:

https://github.com/cms-sw/cmssw/blob/master/FWCore/Services/plugins/CondorStatusUpdater.cc#L308

Note that to reduce the stream of updates, we only send every N minutes at most.

There's a bit of trickiness around getting quoting correct - if you are actually invoking from python, then you might consider using the HTCondor python bindings to properly escape the strings.

CMS's updates basically include the following:

  • A timestamp, so we can determine upstream when progress was last made.
  • I/O layer instrumentation (files processed, bytes read, bytes written, number of operations, total time taken).
  • Physics instrumentation (events / lumis / runs processed).

bbockelm avatar Aug 16 '17 12:08 bbockelm

See also https://jira.isi.edu/browse/PM-1357

duncan-brown avatar Jun 07 '19 17:06 duncan-brown

@duncan-brown Is this now done, or is should this remain open?

ahnitz avatar Sep 21 '21 18:09 ahnitz