signac icon indicating copy to clipboard operation
signac copied to clipboard

Proposal: Cross-references of jobs

Open csadorf opened this issue 6 years ago • 6 comments

Original report by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


About

This issue describes a proposed feature which would standardize the way that jobs may be referenced within and between different projects. One typical use case is the need to store aggregated results or data that is shared among many different jobs within one larger data space.

Rationale

There is currently no standardized way to reference jobs from different projects in order to define relationships of jobs within or across projects. This puts the burden on the user to conceptualize and implement such references, which leads to duplication of effort and possible complications when code is interfaced by 3rd parties. A standardization of references will make it easier for users to setup a data spaces with above mentioned relationships.

Example

Assuming that the user performed multiple computations at different state points and wants to generate aggregated results, such as a phase diagram, based on that data. We propose that such a workflow would be supported with the following API:

from itertools import tee
import signac
# ...

# Main project:
project = signac.get_project()

# Project for aggregated results:
phase_diagrams = signac.get_project('phase-diagrams')

for (p, T), group in project.groupby(('p', 'T')):
    with phase_diagrams.open_job(dict(p=p, T=T)) as pd_job:
        # We store references (links) to the original jobs within
        # the pd_job's document in order to track the data provenance.
        pd_job.doc.origin, group = tee(group)

        # Now, we generate and store the phase diagram.
        generate_and_save_phase_diagram(group, 'phase_diagram.pdf')

The above mentioned workflow allows us to easily determine the origin data:

for pd_job in phase_diagrams:
    origin_jobs = phase_diagrams.lookup(pd_job.doc.origin)

Definitions

Terms used in this proposal document:

  • link: A uniform resource identifier (URI), which contains all information needed to lookup another job.
  • sub-project: A project within a sub-directory of the current project root directory.
  • parent-project: A project within a parent-directory of the current project root directory.
  • neighbor-project: A project within another directory that is on the same level as the current project root directory.

Explicitly supported use-cases

The following use-cases should be supported by the proposed concept and implementation:

  1. Specify the following relationships: one-to-one, one-to-many, many-to-one, many-to-many,

  2. Specify a reference:

    a) within the same project,

    b) from one project to a sub-project,

    c) from one project to a parent-project,

    d) from one project to a neighbor-project.

Concept

We need two pieces of information in order to be able to locate a job within or across projects:

  1. The project that the referenced job belongs to.
  2. The job id of the referenced job.

The project is referenced by a relative or an absolute path to its root directory. A relative path is defined as relative to a specific project, where the default is the current project.

A link is a URI defined like this:

signac://relative/path/to/project#abcdef123456...

The URI scheme is called 'signac', the project root directory is defined as the combination of the netloc and path component, and the job id is specified through the fragment component.

A signac URI can be parsed for example with the urllib.parse.parse_url function:

o = urllib.parse.parse_url('signac://path/to/project#abcdef')
project = signac.get_project(o.netloc + o.path)
job = project.open_job(id=o.fragment)

Proposed API

The high-level API is comprised of project-based methods and root namespace functions.

Project-based API

Using the project-based API, all links are generated relative to a specific project.

project.link_to(job)

Generate a link document for job relative to project.

project.lookup(link)

Lookup the project referenced in link relative to project's root directory and then return the referenced job. This function will raise a LookupError if the referenced project cannot be found and a KeyError if the referenced job does not exist in the looked-up project.

project.lookup_project(link)

Lookup the project referenced in link relative to project's root directory. This function will raise a LookupError if the referenced project cannot be found.

Root-namespace API

The root-namespace API works like the project-based API, but always acts on the current project, that means the project returned by signac.get_project(). The root-namespace API can also be used if users want so specify an arbitrary path or even absolute paths.

signac.link_to(job, from=None)

This function will generate a link to job relative to from. If the argument for from is None (the default), then the link will be relative to the return value of signac.get_project().root_directory(), otherwise it will be relative to path specified in from.

Instead of a directory path, one can also pass an instance of Project as the from argument, in which case the link will be relative to the project's root directory.

signac.lookup(link, from=None)

This function will attempt to look-up the job referenced in link relative to from. If no argument for from is provided, the link will be relative to the return value of signac.get_project().root_directory().

The argument for from can be a directory or an instance of Project.

Automatic-conversion of instances of Job to links

When storing an instance of Job within a job's state point or document, it is automatically converted to a link. For example:

job.doc.other = other_job

This is equivalent to:

job.doc.other = signac.link_to(other_job)

This enables users to specify links with a concise API and predictable behavior. To ensure that links are relative to the project of the job that contains the references, it is recommended to use with:

with job:
    job.doc.other = other_job

By entering the job's workspace prior to the look-up, we can guarantee that we use the same reference:

with job:
    other_job = signac.lookup(job.doc.other)

Examples

Single link to another job

To create a reference to another job you simply call:

project = signac.get_project()
link = project.link_to(other_job)
# equivalent to: link = signac.link_to(other_job, from=project)

To look up the referenced we use the complimentary Project.lookup function:

other_job = project.lookup(link)
# equivalent to: other_job = signac.lookup(link, from=project)

In general the following relationship is always true:

assert other_job = project.lookup(project.link_to(other_job))

Link across projects

Jobs and their reference do not need to belong to the same project. For example:

project_a = signac.get_project('a/')
project_b = signac.get_project('b/')

job_in_a = project_a.open_job({'foo': 0})
job_in_b = project_a.open_job({'bar': 0})

job_in_a.doc.other_job = project_a.link_to(job_in_b)
job_in_b = project_a.lookup(job_in_a.doc.other_job)

Caveats

Migration

Changing the state point of a job, for example by adding an additional key, changes its id and will therefore break the references. Therefore, special care must be taken when migrating referenced jobs:

Assuming that we have a one-to-many relationship, where one parent-job is referenced by many children-jobs:

for parent in project_a:
    for i in range(3):
        link_to_parent = project_b.link_to(parent)
        project_b.open_job(dict(i=i, parent=link_to_parent)).init()

Then, to properly migrate all parent jobs, we could use the following recipe, where we take advantage of the groupbydoc function:

for parent_link, children in project_b.groupbydoc('parent'):
    parent = project_b.lookup(parent_link)
    parent.sp.setdefault('new_key', False)
    for child in children:
        child.doc.parent = project_b.link_to(parent)

Fixing broken references

Assuming that a user migrated jobs without taking care to update the references. One could use the following recipe to repair those broken links:

for parent_link, children in project_b.groupbydoc('parent'):
    broken_parent = signac.lookup(parent_link)
    assert broken_parent not in project_a
    parent_candidates = project_a.find_jobs(broken_parent.sp())
    assert len(parent_candidates) == 1
    for child in children:
        child.doc.parent = project_b.link_to(parent_candidates[0])

csadorf avatar Aug 17 '18 21:08 csadorf

Original comment by Bradley Dice (Bitbucket: bdice, GitHub: bdice).


I don't think we've discussed "many to many" relationships. I'm not sure (if?) how those fit into this framework.

csadorf avatar Aug 18 '18 00:08 csadorf

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


@bdice I would argue that a many-to-many relationship in this context could be realized by storing many references to project A and many references to project B within one location. This could for example be a job that uses data from multiple jobs from multiple different projects.

csadorf avatar Aug 18 '18 03:08 csadorf

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


Update link to be a URI, not a document.

csadorf avatar Aug 18 '18 04:08 csadorf

Addressing this issue thoroughly requires quite a bit of additional work. For the moment I'm bumping this feature past the 2.0 milestone. I'm (possibly ambitiously) targeting #189 for version 2.0, but since there's more work to do to enable this proposal fully I don't think we need this for 2.0. I also don't think this feature needs to break any existing APIs, so we could add it to a 2.x release.

vyasr avatar Sep 16 '20 14:09 vyasr

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 31 '21 06:03 stale[bot]

Reopening since there is still interest in this work.

vyasr avatar Feb 21 '22 20:02 vyasr