nextflow
nextflow copied to clipboard
Remove URL fragment in lineage IDs
This PR removes the use of URL fragments in LIDs, since jq can be used on the command line. The #output shortcut is replaced by adding both WorkflowLaunch and WorkflowRun to the history log.
- Rename
WorkflowRun->WorkflowLaunch - Rename
WorkflowOutput->WorkflowRun - Add launch LID / run LID to
nextflow lineage list - Remove
TaskOutputsince it is not used - Remove the use of URL fragments
Deploy Preview for nextflow-docs-staging ready!
| Name | Link |
|---|---|
| Latest commit | 0193cfe82610a6b7f747e8b8653be585701c36d5 |
| Latest deploy log | https://app.netlify.com/sites/nextflow-docs-staging/deploys/681ec3d0df2f7d0008d91d0b |
| Deploy Preview | https://deploy-preview-6011--nextflow-docs-staging.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
This PR removes a lot of unnecessary complexity around the LID filesystem that I feel is getting in the way of core use cases.
There are a few tests still failing, mainly because the view command now goes through LinPath instead of LinUtils.query() to resolve the LID, and it is currently trying to return the actual path of FileOutputs instead of the metadata description.
Before we move this forward, I think we should decide whether we actually want to embed the workflow output value in the metadata. Like I mentioned, a pipeline with a samples output could produce 100,000 samples, which would be saved as a giant list in the metadata.
Alternatively, the user can save the output to an index file (i.e. samplesheet) and just reference that index file in a downstream pipeline, like they already do. We could simply reference this index file in the metadata instead of the contents. I intend to explore this as part of the TraceObserverV2 proposal.
In that case, maybe we could drop the use of URI fragments entirely. Users already have the index file as a drop-in replacement for samplesheets, and traversing the metadata can already be done more effectively via jq or json-path.
So I think my main goal for the 25.04 release is to remove the URI syntax overloading (fragment, query params) in favor of something more familiar and flexible.
Alternatively, the user can save the output to an index file (i.e. samplesheet) and just reference that index file in a downstream pipeline, like they already do. We could simply reference this index file in the metadata instead of the contents. I intend to explore this as part of the TraceObserverV2 proposal.
AFAIK, the index file is not mandatory, users need to indicate they want to write the index file, right? In the case, they do not write it, what we should put as output? Maybe we should write the index file in all the cases and publish to the file if indicated or store as lineage metadata.
There are a few tests still failing, mainly because the
viewcommand now goes throughLinPathinstead ofLinUtils.query()to resolve the LID, and it is currently trying to return the actual path of FileOutputs instead of the metadata description.
I think when view, it should just try to load and check the fragment. In LinPath it is also checking if the path is a subpath of a description but it is not needed in this case.
Update this PR based on the latest changes, focusing on removing the use of URL fragments entirely in the LID filesystem. This makes the user experience simpler (don't need to remember the #output hack) and removes a lot of complex code.
Every workflow execution now has a "launch LID" and "run LID". The former is added to the history log when the workflow begins, the latter is added when the workflow completes.
Here's a simple test you can run to get started:
rm -rf .lineage/
nextflow run rnaseq-nf -r lineage -profile conda -resume --labels foo,bar
nextflow lineage list
@jorgee if these changes make sense to you, can you help me fix the tests? I think they are the same ones as before
Expected output:
$ nextflow lineage list
TIMESTAMP RUN NAME SESSION ID LAUNCH LID RUN LID
2025-05-02 19:06:15 CDT stoic_shaw bc79451f-c573-4b7d-8e7c-697be8d9cefc lid://cd7197c02ab1250eafc2bf7499715e5f lid://304c57e48ab6b324715ad2c5ba55b25e
$ nextflow lineage view lid://304c57e48ab6b324715ad2c5ba55b25e
{
"type": "WorkflowRun",
"createdAt": "2025-05-02T19:06:16.160559118-05:00",
"workflowLaunch": "lid://cd7197c02ab1250eafc2bf7499715e5f",
"output": [
{
"type": "Path",
"name": "summary",
"value": "lid://cd7197c02ab1250eafc2bf7499715e5f/summary/multiqc_report.html"
},
{
"type": "Collection",
"name": "samples",
"value": "lid://cd7197c02ab1250eafc2bf7499715e5f/samples.json"
}
]
}
I added the completion status to the WorkflowRun. It can be SUCCEEDED, FAILED, or CANCELLED, consistent with platform terminology
I think this changes are valuable enough to merge now.
@jorgee do you have any remaining concerns? You mentioned something about the workflow outputs not having LIDs, but I don't remember exactly. Otherwise if these changes make sense to you, please approve and I will merge later
My main concern was about the two LIDs and the path of the workflow output files. They are based on the LAUNCH LID but the users could expect to have it in the RUN LID.
Could it be done the other way? I would expected the final workflow run hash to include the workflow outputs as components, so there would be no way for a workflow output to refer to the final workflow run that is referring to it