ro-crate icon indicating copy to clipboard operation
ro-crate copied to clipboard

Use Case: Record a software (version) within a conda environment

Open joergfunger opened this issue 2 months ago • 4 comments
trafficstars

As a developer of a workflow that executes benchmarks on different simulation software (FEM), I want to write a workflow that computes the outputs for different benchmarks and different tools so that I can query the provenance KG to compare the results. Each run of that workflow (potentially with different configurations specifying which simulation software/version to benchmark) would result in a separate ROCrate which would then jointly be analyzed. The workflow is implemented in a workflow manager (snakemake and nextflow). The workflow is composed of multiple steps and loops (in that hierarchy) over all benchmarks, over all tools the benchmark is implemented in, and then over all parameter files (e.g. different mesh densities in a convergence study). There are multiple task/processes, and the compute environments are specified using conda. At a later stage I would like to query the provenance graph for results that have been obtained by a specific simulation software (e.g. to plot the results of benchmark A for software tools X.version1.2.3 and y.version3.4.5 and z.version4.5.6). As such I need a PID to describe the software X.version.1.2.3. Is there a way of defining a PID based on a conda package that is installed, e.g. how would a reference a software in a workflow provenance. In the current nextflow provenance plugin, the instrument of a CreateAction of the complete workflow is the nextflow workflow. On the level of each process , the instrument is not specified yet. However, I'm not sure what should be added there, since the conda environment is neither a SoftwareApplication with a unique URL (it is composed of multiple packages), nor is it a single SoftwareSourceCode. So in theory I could add all packages that are installed as additional items (so that would be a huge list), however in practice I'm only interested in a specific package related to the simulation tool (the rest are usually just dependencies of that tool). However, for a general plugin that documents the provenance, I'm not sure what we are aiming for to be documented when the process/task in a workflow step is a python script that is executed in a conda environment. Any ideas on that would be really appreciated.

In addition, I was looking into using swhid to create/use an PID for that software package, but that seems difficult to be related to an installed version in a conda environment.

As an alternative I thought about using a docker container as in issue #39 , but the challenge of documenting the actual software tool with its version that is installed in the container remains identical.

joergfunger avatar Sep 10 '25 15:09 joergfunger