nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Remove URL fragment in lineage IDs

Open bentsherman opened this issue 7 months ago • 11 comments

This PR removes the use of URL fragments in LIDs, since jq can be used on the command line. The #output shortcut is replaced by adding both WorkflowLaunch and WorkflowRun to the history log.

  • Rename WorkflowRun -> WorkflowLaunch
  • Rename WorkflowOutput -> WorkflowRun
  • Add launch LID / run LID to nextflow lineage list
  • Remove TaskOutput since it is not used
  • Remove the use of URL fragments

bentsherman avatar Apr 25 '25 21:04 bentsherman

Deploy Preview for nextflow-docs-staging ready!

Name Link
Latest commit 0193cfe82610a6b7f747e8b8653be585701c36d5
Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/681ec3d0df2f7d0008d91d0b
Deploy Preview https://deploy-preview-6011--nextflow-docs-staging.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Apr 25 '25 21:04 netlify[bot]

This PR removes a lot of unnecessary complexity around the LID filesystem that I feel is getting in the way of core use cases.

There are a few tests still failing, mainly because the view command now goes through LinPath instead of LinUtils.query() to resolve the LID, and it is currently trying to return the actual path of FileOutputs instead of the metadata description.

bentsherman avatar Apr 25 '25 21:04 bentsherman

Before we move this forward, I think we should decide whether we actually want to embed the workflow output value in the metadata. Like I mentioned, a pipeline with a samples output could produce 100,000 samples, which would be saved as a giant list in the metadata.

Alternatively, the user can save the output to an index file (i.e. samplesheet) and just reference that index file in a downstream pipeline, like they already do. We could simply reference this index file in the metadata instead of the contents. I intend to explore this as part of the TraceObserverV2 proposal.

In that case, maybe we could drop the use of URI fragments entirely. Users already have the index file as a drop-in replacement for samplesheets, and traversing the metadata can already be done more effectively via jq or json-path.

So I think my main goal for the 25.04 release is to remove the URI syntax overloading (fragment, query params) in favor of something more familiar and flexible.

bentsherman avatar Apr 26 '25 22:04 bentsherman

Alternatively, the user can save the output to an index file (i.e. samplesheet) and just reference that index file in a downstream pipeline, like they already do. We could simply reference this index file in the metadata instead of the contents. I intend to explore this as part of the TraceObserverV2 proposal.

AFAIK, the index file is not mandatory, users need to indicate they want to write the index file, right? In the case, they do not write it, what we should put as output? Maybe we should write the index file in all the cases and publish to the file if indicated or store as lineage metadata.

jorgee avatar Apr 28 '25 08:04 jorgee

There are a few tests still failing, mainly because the view command now goes through LinPath instead of LinUtils.query() to resolve the LID, and it is currently trying to return the actual path of FileOutputs instead of the metadata description.

I think when view, it should just try to load and check the fragment. In LinPath it is also checking if the path is a subpath of a description but it is not needed in this case.

jorgee avatar Apr 28 '25 08:04 jorgee

Update this PR based on the latest changes, focusing on removing the use of URL fragments entirely in the LID filesystem. This makes the user experience simpler (don't need to remember the #output hack) and removes a lot of complex code.

Every workflow execution now has a "launch LID" and "run LID". The former is added to the history log when the workflow begins, the latter is added when the workflow completes.

Here's a simple test you can run to get started:

rm -rf .lineage/
nextflow run rnaseq-nf -r lineage -profile conda -resume --labels foo,bar
nextflow lineage list

@jorgee if these changes make sense to you, can you help me fix the tests? I think they are the same ones as before

bentsherman avatar May 03 '25 00:05 bentsherman

Expected output:

$ nextflow lineage list
TIMESTAMP               RUN NAME        SESSION ID                              LAUNCH LID                              RUN LID                               
2025-05-02 19:06:15 CDT stoic_shaw      bc79451f-c573-4b7d-8e7c-697be8d9cefc    lid://cd7197c02ab1250eafc2bf7499715e5f  lid://304c57e48ab6b324715ad2c5ba55b25e

$ nextflow lineage view lid://304c57e48ab6b324715ad2c5ba55b25e
{
  "type": "WorkflowRun",
  "createdAt": "2025-05-02T19:06:16.160559118-05:00",
  "workflowLaunch": "lid://cd7197c02ab1250eafc2bf7499715e5f",
  "output": [
    {
      "type": "Path",
      "name": "summary",
      "value": "lid://cd7197c02ab1250eafc2bf7499715e5f/summary/multiqc_report.html"
    },
    {
      "type": "Collection",
      "name": "samples",
      "value": "lid://cd7197c02ab1250eafc2bf7499715e5f/samples.json"
    }
  ]
}

bentsherman avatar May 03 '25 00:05 bentsherman

I added the completion status to the WorkflowRun. It can be SUCCEEDED, FAILED, or CANCELLED, consistent with platform terminology

bentsherman avatar May 05 '25 16:05 bentsherman

I think this changes are valuable enough to merge now.

@jorgee do you have any remaining concerns? You mentioned something about the workflow outputs not having LIDs, but I don't remember exactly. Otherwise if these changes make sense to you, please approve and I will merge later

bentsherman avatar May 05 '25 16:05 bentsherman

My main concern was about the two LIDs and the path of the workflow output files. They are based on the LAUNCH LID but the users could expect to have it in the RUN LID.

jorgee avatar May 05 '25 16:05 jorgee

Could it be done the other way? I would expected the final workflow run hash to include the workflow outputs as components, so there would be no way for a workflow output to refer to the final workflow run that is referring to it

bentsherman avatar May 05 '25 16:05 bentsherman