pypiper KeyError: 'Time' when using pipestat via pypiper

When I'm trying to switch from a normal pypiper pipeline to one that configures pipestat, I'm getting this error:

Traceback (most recent call last):
  File "/home/nsheff/code/seqcolapi/analysis/pipeline/add_to_seqcol_server.py", line 92, in <module>
    pm.stop_pipeline()
  File "/home/nsheff/.local/lib/python3.11/site-packages/pypiper/manager.py", line 2106, in stop_pipeline
    self.report_result("Time", elapsed_time_this_run, nolog=True)
  File "/home/nsheff/.local/lib/python3.11/site-packages/pypiper/manager.py", line 1616, in report_result
    reported_result = self.pipestat.report(
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/home/nsheff/.local/lib/python3.11/site-packages/pipestat/pipestat.py", line 99, in inner
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nsheff/.local/lib/python3.11/site-packages/pipestat/pipestat.py", line 571, in report
    schema=self.result_schemas[r],
           ~~~~~~~~~~~~~~~~~~~^^^
KeyError: 'Time'

I can't track this because I'm not doing anything related to Time. so it must be coming from pypiper or pipestat somehow.

Feb 15 '24 14:02 nsheff

One hint is this message:

These results exist for 'DEFAULT_SAMPLE_NAME': Time
These results exist for 'DEFAULT_SAMPLE_NAME': Success

It looks like there might be a bug somewhere with a constant that is getting stored as a string instead.

Feb 15 '24 15:02 nsheff

I think pipestat_sample_name is not being passed through to pipestat

Feb 15 '24 17:02 nsheff

actually I think it's pipestat_results_file that's not working correclty...

Feb 15 '24 17:02 nsheff

I figured it out.

Pypiper automatically adds results for Time and Success. If those aren't in your output schema, it fails. So you have to add this to the output schema:

  Time:
    type: "string"
    description: "Elapsed time for the pipeline run as reported by pypiper"
  Success:
    type: "string"
    description: "Timestamp for when the pipeline completed"

I think this is suboptimal, since I am not putting those in, they're automatic. Maybe pypiper should be the one adding them to the output schema, since it's the one reporting them automatically.

Feb 15 '24 17:02 nsheff

I made a more informative error message in pipestat to address this here: https://github.com/pepkit/pipestat/commit/0d511b5960d460b4dda701379f6a982e3f407a0c

This at least solves the immediate issue, but going forward:

[ ] pypiper should add anything it uses into the schema on its own
[ ] so, pipestat, probably needs to make it easier to merge/update/combine schemas. right now you can only give it a file path, and that's it -- there's no way to set the schema programmatically, or update it, or whatever. so, first, the pipestat schema loading system needs to be more flexible, in order to allow pypiper to update the schema and add its parameters.

Feb 15 '24 17:02 nsheff

Also confirmed this by adding the output_schema to the Pipelinemanager during the test_pipeline_manager.py test (I was initially surprised our tests didn't catch this):

        self.pp = pypiper.PipelineManager(
            "sample_pipeline", outfolder=self.OUTFOLDER, multi=True, pipestat_schema="/home/drc/GITHUB/pypiper/pypiper/tests/Data/sample_output_schema.yaml"
        )

It will indeed fail with a KeyError: tests/pipeline_manager/test_pipeline_manager.py::PipelineManagerTests::test_me - KeyError: 'Time'

Feb 15 '24 23:02 donaldcampbelljr

pypiper pypiper copied to clipboard

KeyError: 'Time' when using pipestat via pypiper

pypiper
pypiper copied to clipboard