signac-flow icon indicating copy to clipboard operation
signac-flow copied to clipboard

Proposal for the improved management of provenance and reproducibility

Open csadorf opened this issue 7 years ago • 9 comments

Original report by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


About

This is a proposal for the integration of features into signac-flow that would simplify provenance management and increase the chance for reproducibility. Such features would enable users to keep track of the exact path that data took from its source to the current / latest stage.

Note: This feature depends on and is developed in tandem with the snapshot feature (signac/issue #69) for the signac core application.

Provenance vs. Reproducibility

First, we need to clearly define and distinguish what we mean by data provenance and reproducibility. The first refers to the ability to know exactly where data came from and which operations were applied to to them, while reproducibility means that we can also execute these operations again.

To preserve the provenance of the data set, we need to track

  1. The original state of the (input) data,
  2. the order of operations that were applied to the data set,
  3. any parameters that were provided as arguments,
  4. the state of the data set before application,
  5. the software and hardware environment that was used for execution [1].

To be able to immediately reproduce the workflow we also need to

  1. Have access to all scripts used for execution,
  2. access to the original or an equivalent execution environment (hardware and software) or at least the ability to create such an equivalent environment.

[1] The approach for tracking the environment is described in the next section.

Approach

The general idea is to create a snapshot before the execution of a data space operation and to store operation related metadata to a log file or a log document entry. This may be done manually, but could also happen automatically, for example when operations are executed through the flow.run() or the Project.run() interface.

Such a log entry would have the following fields:

{
  timestamp: float,  # The UTC time in seconds since the UNIX epoch when the operation was triggered.
  snapshot_id: int, # The id of the snapshot taken immediately before execution.
  operation: str,  # The name of the operation.
  cmd: str, # The command used for execution of this operation.
  canonical: bool, # Indicates whether the command is notated in canonical form [1].
  environment: dict,  # Information about the execution environment [2].
  num_ranks: int, # The number of MPI ranks used for execution.
  note: str,  # Optional user-defined remarks or comments.
}

A log entry is either appended to a field called log as part of the job document and/ or written to a collection-based log file.

[1] The recorded command is assumed to be executed within the project root directory and should ideally be in canonical notation. A canonical command is a command that is a function of job and is therefore easily transferable between jobs. For example, the user may have executed operation my_op in parallel using the flow.run() interface with the command python operations.py my_op, however, the canonical command would be python operations.py my_op {job._id}. All commands executed through the Project.run() and the flow.run() interface would automatically be canonical, other commands might require some extra treatment.

CAVEAT: The actual command used for execution can not always be accurately preserved. For example, it is possible to record the number of MPI ranks used for execution, but it is not possible to automatically record which mpirun wrapper variant was actually used for execution and with which parallelization layout. We are assuming that this is desired behavior and that the operations logic is not a function of the parallelization layout.

[2] Capturing the hardware and software (library) environment accurately and comprehensively is only possible under specific circumstances, e.g., when using a containerized environment. Existing tools such as pip freeze are capable to capture most software versions within the current python environment, however local extensions to the python path or other locally installed packages may not be captured. Non-python tools are even harder to automatically detect. We therefore propose a good enough approach, where it is mostly user responsibility to ensure that all essential environment parameters are accurately tracked, but the framework assists with that as much as reasonably possible. See below for the proposed API on how to track the environment.

Proposed API

Keeping an operations log

Logging of operations would be enabled in the following ways:

  1. Manually via the internal flow._log_operation() function,
  2. Manually via the low-level flow.log_operation() functor,
  3. through arguments to flow.run(),
  4. through arguments to Project.run().

The API would also be exposed on the command line and the configuration, but for the sake of brevity, we are going to focus on the python API here.

Manual logging

This is how we would create a manual log entry:

flow.log_operation(
    # An instance of JobOperation
    operation,
    # Indicate whether the command is in canonical form (see above) (default: False).
    canonical=False,
    # Generate snapshots before execution if True (default: True).
    snapshot=True,
    # The field within the job document that contains the operations log.
    # None means 'do not store' (default: 'log').
    field='log',
    # The filename of a log file relative to the job's workspace path.
    file='operations.log',
    # An additional user-defined comment (default: None).
    note="Executed through aprun.",
    # Restore the job if the execution of the operation fails (default: True) [1].
    restore_on_fail=True,
): {<log entry>} (dict)
  • The file argument can alternatively be a file-like object.
  • The snapshot argument may optionally be a mapping, to be able to forward arguments to the snapshot function, e.g., {deep': True}.

Manual tracking is useful where operations are not executed through the flow interface. It is up to the user to ensure that the entry is correct and that the operation is actually executed after creating the log entry.

[1] The log_operation() functor may be used as context manager, where the log entry is only created if the execution of the operation actually succeeded, otherwise the snapshot is restored:

op = JobOperation('myop', job, './src/my_op {}'.format(job))
with log_operation(op):
    subprocess.run(op.cmd)

This would be roughly equivalent to:

op = JobOperation('myop', job, './src/my_op {}'.format(job))
log_entry = log_operation(op, field=None)
try:
    subprocess.run(op.cmd)
except:
    project.restore(_id=log_entry['snapshot_id'])
    raise
else:
    flow._log_operation(log_entry)

Semi-automatic logging through the flow interface

Alternatively, we can use the extended flow.run() interface to keep track:

flow.run(
    # An optional sequence of job-operations (default: None).
    operations=None,
    # The field within the job document that contains the operations log.
    # None means 'do not store' (default: 'log').
    log='log',
    # Specify a log file name relative to the job's workspace (default: None).
    logfile='operations.log',
    # Generate snapshots before execution if True (default: True).
    snapshot=True,
    # An optional parser argument to add the run CLI arguments to (default: None).
    parser=None,
    # Additional keyword arguments to be forwarded to the operations function.
    **kwargs
    )
  • When the first argument is None, the command line interface is started (replicating previous behavior).
  • The logfile argument can alternatively be a file-like object.
  • The snapshot argument may optionally be a mapping, to be able to forward arguments to the snapshot function, e.g., {deep': True}.

Using the flow.run() interface, the logged command would be logged in canonical form for the execution of a single operation for a single job, that means even if the operation was actually executed for all jobs like this: python operations.py my_op, it would be logged as python operations.py my_op {job._id}.

Keeping track of the hardware and software (library) environment

The capturing of environment parameters would be embedded into the existing flow.environment structure. The DefaultEnvironment would be extended to capture a few parameters that are always accessible, such as the python version environment or information provided by the platform module. For everything else, user input is needed. The current definition of the Environment class could be extended in this form:

class MyEnvironment(DefaultEnvironment):

    MyEnvironment.track_version('gcc', cmd='gcc --version')

Some tracking may also be enabled through the configuration:

[flow]
[[environment]]
[[[track]]]
[[[[gcc]]]]
cmd="gcc --version"

Example Usage

Listing all operations that have been applied to a specific job from the beginning to now, would be trivial:

pprint(job.document['log'])

Assuming we wanted to replicate the workflow applied to a_job to another_job, and that commands have been recorded canonically, this is a possible approach:

log = a_job.document['log']
for ts in sorted(log):
    subprocess.run(log[ts]['cmd'].format(job=another_job))

A canonical command is a function of the job's state point and document, e.g.: ./src/operations.py my_op {job._id}

The logic implemented above would also be implemented as a high-level function, for instance to replay the exact same sequence of operations:

# Execute with subprocess.run():
for job_op in flow.replay(a_job.document['log'], job=another_job):
    subprocess.run(job_op.cmd)

# Execute with flow.run():
flow.run(flow.replay(job.document['log']), job=another_job)

csadorf avatar Sep 26 '17 15:09 csadorf

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


@PaulDodd @vyasr @bdice @jglaser I would appreciate your input.

csadorf avatar Sep 26 '17 15:09 csadorf

Original comment by Joshua Anderson (Bitbucket: joaander, GitHub: joaander).


I haven't read through these proposals in detail, but I do have one specific comment/concern. The use of hard links for snapshots poses some risk of duplicating data when transferring between file systems. A careless cp, scp, or rsync without the proper arguments will duplicate the linked data on the target filesystem. I do not know if globus supports preserving link structure.

csadorf avatar Sep 26 '17 15:09 csadorf

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


@joaander Thank you very much for bringing up this concern. I will make sure to keep that in mind and appropriately test and address that.

csadorf avatar Sep 26 '17 15:09 csadorf

Original comment by Bradley Dice (Bitbucket: bdice, GitHub: bdice).


This is a good proposal, I think it is well thought-through and I could see myself using this feature since it won't distort my current workflow patterns very much.

I think I would want to know how source version control plays into this. We could suggest that users use MyEnvironment.track_version('git-commit', cmd='git rev-parse HEAD') or a similar command to track the git commit version of their code when it is run. If I altered my source code, I'd like to know which commit was used when the operation ran. That's perhaps the closest we could get to actually logging the function itself, and the hash is meaningful if I wanted to go back to that point in time in my source code.

csadorf avatar Sep 26 '17 20:09 csadorf

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).


It might also be worthwhile adding snapshot related options directly to flow.add_operation if we want to add configurability of logging on an operation level. I think we might also want to set some limits or at least the ability to warn when too many snapshots have been taken; if someone starts using flow.run with logging on for every operation and regenerates any data I could see space usage rising very quickly. Obviously that is the user's responsibility to some degree, I would just suggest including some way to track this and warn about it.

I second Bradley's proposal. Regarding how close we could get to logging the function itself, there is a much more heavyweight solution, which would be to do import inspect; inspect.getsource(my_op) and actually store that output directly rather than depending on sufficiently frequent git commits of source. I'm inclined to veto using that option though, at least not as the default behavior. Users who know they don't commit frequently can make that change, but it involves saving the same function text many times and is probably not worthwhile in most cases. An intermediate option could be having flow.run do some sort of check on whether we are in a git repo, and if so, whether there are uncommitted changes to the operations before actually executing.

csadorf avatar Sep 28 '17 17:09 csadorf

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


This is currently scheduled for the 0.7 release.

csadorf avatar May 30 '18 13:05 csadorf

@csadorf would this be completely addressed by #189? More specifically, not the hooks framework alone, but also the packaged hooks that you have introduced there? If so, I'd like to link this issue. If not, we should document what else needs to happen.

vyasr avatar Feb 26 '20 17:02 vyasr

Yes, I believe this would be resolved by #189.

csadorf avatar Feb 26 '20 18:02 csadorf

👍 Linked

vyasr avatar Feb 26 '20 18:02 vyasr