jwql icon indicating copy to clipboard operation
jwql copied to clipboard

Shared pipeline outputs for instrument monitors

Open bhilbert4 opened this issue 2 years ago • 10 comments

Currently each instrument monitor works in its own silo. Data are retreived from MAST, steps of the pipeline may be run, and analyses are done.

It could be helpful to have a given monitor check the output products from other monitors to see if a file it needs already exists in the JWQL products area. In that case, we could save having to run pipeline steps, which would save time as well as memory. With more and more monitors being added, there will be times when more than one monitor is running at a time on the server. We've seen that with large input files (and most likely multiple monitors running) we can use up all available memory. (#905)

Perhaps a separate module that can search all monitor output directories for the version of the file needed, and then creates a symbolic link to it if it exists?

bhilbert4 avatar Mar 24 '22 21:03 bhilbert4

Perhaps a centralized module that knows for data with a particular instrument/aperture/exp_type/etc, the dark monitor needs steps x, y, z of the pipeline run, and outputs from steps x and z, while the bad pixel monitor needs steps w, x, y, z run, and outputs from w and z. And then it will run those steps and save the outputs at the needed steps, and perhaps create symbolic links for the appropriate files to each monitor's output directory.

bhilbert4 avatar Mar 29 '22 20:03 bhilbert4

Running things with a reduction server is distinctly possible. I have experience with doing this for celery. The basic organization would be for the monitor to look for new files, and send a request to the reduction server (via celery) for each file that needs to be reduced, and then provide a callback function which would open the processed data (or processing log?) and write to the database. The reduction server could return immediately if the reduced data already exists, and process reductions one by one (or n by n depending on how many workers we can get resources for), and hopefully reduce contention. I have existing code that does this for STIPS, and could put together a proof-of-concept from an existing monitor if that would be useful.

york-stsci avatar Mar 30 '22 15:03 york-stsci

I think it would be interesting to see. How would the reduction server know which pipeline steps to create outputs for? For example, if both the dark monitor and the readnoise monitor will use the same file, but the dark monitor needs the output from the jump step and the ramp-fitting step, while the readnoise monitor needs the output from the refpix subtraction step. It seems like we would need a mapping somewhere to describe this, since the dark monitor itself, when asking for a calibrated file, doesn't know anything about the calibrations needed by other monitors.

bhilbert4 avatar Mar 31 '22 14:03 bhilbert4

The way that celery does this is that, effectively, you're calling the celery server and passing it anything that can be serialized. In practical terms, this usually means a dictionary. We're (now) running into a limit to my understanding of how the JWST pipeline works, since it's not something I'm incredibly familiar with, but assuming that it's re-entrant in the sense that you can run step A, and then later go back and run step B (using the output from step A), and assuming that each step generates distinct output files (or at least can), then the way that I would do it is:

  • supply a dictionary that has:
    • the name of the input file (I suggest if you need several files, submit them one-by-one)
    • an entry for each pipeline step, with a boolean True/False for whether that step should happen

I'm assuming that there is a known output file naming scheme for each step, and that the monitor only needs the calibrated file rather than direct access to the calibration logs.

In this case, the worker thread would then

  • Check to see whether the output file already exists. If so, return instantly
  • Calibrate the appropriate steps. For each step, check whether that step has been done.
  • Once the appropriate steps have been done, return a file path to the requested file

So in essence the monitors don't have to know (or care) if a file they're adding has already been calibrated, they just always ask for the calibration and, if it's already been done, they just get an instant callback with the path to the file. Also, this way, if the worker thread dies because of (e.g.) a segmentation fault, then the monitor will still complete, and will know that the calibration failed. And, because the worker process looks for a finished file before running, if necessary one of us can just manually calibrate a file and put the calibrated file in the right place, and neither the monitor nor the worker will care how it got there.

This would, of course, require a few modifications to the monitors. Probably the biggest one is that the monitors would (potentially) start, come up with a list of files that need calibration, send them to celery, and be done before any of the calibrations have happened, so there would need to be a way for the monitor to idle until the calibration(s) finish without taking up too many resources. That said, I think that's a solved problem in general, and we can figure out a way to do that.

Does the above make any sense to anyone but me?

york-stsci avatar Mar 31 '22 15:03 york-stsci

I guess one thing that I hadn't thought of until now is what happens if monitor A needs file X, and monitor B also needs file X, and B submits the request while the request from A is running, but before it finishes. If we only have the resources to run a single worker thread, that's not a problem, because by the time monitor B's task runs the file will exist, but if we're running in parallel we have to have some way to check for whether a particular calibration is already ongoing. That said, I suspect that this is also a solved problem somewhere, and we can borrow or create our own solution if we're ever in the fortunate place of being able to run multiple pipeline threads simultaneously.

york-stsci avatar Mar 31 '22 15:03 york-stsci

I think that does make sense, and could be really useful. Handling the pipeline outputs might be a little tricky. By default, the pipeline saves the output only from the ramp-fitting step, which is the final step in stage 1 of the pipeline. (I think we can ignore stages 2 and 3 of the pipeline for now). In order to save the results from any other step, the user has to specify that.

With my NIRCam hat on, the steps of the stage 1 pipeline are:

  1. Attach bad pixel mask to the data
  2. Flag saturated pixels
  3. Subtract superbias
  4. Subtract reference pixels
  5. Linearize the ramp
  6. Dark subtraction
  7. Flag CR hits
  8. Ramp-fitting

So if the dark monitor needs the output from step 7, while the readnoise monitor needs the output from step 4, in order to get both of those outputs from a single run of the pipeline, you need to specify those outputs in the pipeline call. If you were to ask only for the output of step 7, there won't be an output file saved from e.g. step 3 that can be picked up and have step 4 run on it later. You would have to start over from the beginning.

bhilbert4 avatar Mar 31 '22 15:03 bhilbert4

I did create some tools that sound similar, in https://github.com/spacetelescope/jwql/blob/develop/jwql/instrument_monitors/pipeline_tools.py

In that case the code checks what pipeline steps have been run on a given file and determines what steps need to be run in order to get the desired output. It uses boolean tables like you mentioned. This was meant to save time in terms of not running and re-running the same initial pipeline steps. In practice I'm not sure how useful it actually is, since the starting files from MAST will usually be uncal (i.e. raw). For dark current files, there is also a *_dark.fits file in MAST, which has been run through the first 5 steps above. In that case I think the tool helps for cases where we also want the output from steps 7 and 8.

bhilbert4 avatar Mar 31 '22 15:03 bhilbert4

Are there any data types where we won't eventually want all of the pipeline stages? I'm wondering because, in order to make things a bit conceptually simpler, I might be able to just always run the entire stage 1 pipeline, and always turn on all of the output file types, and just call it done.

york-stsci avatar Mar 31 '22 16:03 york-stsci

I'm betting that in almost all cases we will never want the output from all the pipeline stages. But we may be able to narrow things down to the outputs from just a few steps.

bhilbert4 avatar Apr 01 '22 14:04 bhilbert4

Okay.

Based on my testing so far, it seems like the lowest-disruption way of doing this is to create a celery pipeline task (probably using the redis backend, just because that's what I'm familiar with), and including celery-singleton (which basically will look for identical requests to calibrate the same file and, instead of starting another calibration task, return the AsyncResult that corresponds to the existing task.

At that point, the lowest-disruption way of working on the monitors is to take the part of the monitor that calibrates data, and replace it with

result = calibration_task.delay(file_to_calibrate) path_to_calibrated_data = result.get()

which will act as a blocking call that will just delay until the calibrated data is available. There are also ways to not block on waiting for a result, and instead include a callback (or equivalent), to let the monitor queue up a bunch of reductions and then idle while waiting for them to finish, but that would require more changes to the monitors, and the two methods can co-exist quite easily, so we can make the changes when and as needed.

york-stsci avatar Apr 01 '22 15:04 york-stsci