data-repository-service-schemas icon indicating copy to clipboard operation
data-repository-service-schemas copied to clipboard

How to handle related objects

Open delagoya opened this issue 5 years ago • 28 comments

Given a user has a DRS URI to a BAM file, they most likely will also need access to the index file. You can encode the pair as a bundle but would need some mechanism to define what is the "primary" file of the bundle. Otherwise the DRS URI cannot be passed cleanly to the underlying workflow engine or tool.

Example, here are the set of DRS URIs that we are discussing

drs://foo.org/1234 # the BAM
drs://foo.org/1236 # the index 
drs://foo.org/4562 # the bundle that contains the above

Many workflows are encoded with directives such as:

samtools view drs://foo.rog/1234 chr1:10-10000

We would need to find some way to identify that the bundle contains the necessary files that the main file depends on for functionality, and/or to define that the main file is the main file that legacy algs are working with in the parameter space.

delagoya avatar Apr 30 '19 13:04 delagoya

The response to the bundle query returns the names of the bundle objects. Is that sufficient? he bundle would say, more or less:

{
  { "name": "foo.bam", "id": "1234", "drs_uri": "drs//dnsname/1234"},
  {"name": "foo.bam.bami", "id": "1236", "drs_uri": "drs://dnsname/1236"}
}

ddietterich avatar Apr 30 '19 14:04 ddietterich

CWL has a concept of "secondary files" for exactly this purpose (although secondary files are an attribute of File, not Directory.) I agree with the idea of a bundle being able to identify exactly one object as the "primary" file.

tetron avatar Apr 30 '19 15:04 tetron

I do not think we should add such a special-purpose mechanism. How do you define the semantics of the "primary" file?

It also seems to me that there is a recursion problem. The application has to somehow know what kind of bundle it is so that the application can interpret what "primary" means in that case. How does the application know that?

If the application has that knowledge, why can't it also know which object name is the BAM file? If the application does not have that knowledge, then we need to add some typing to bundles. And around we go.

ddietterich avatar Apr 30 '19 16:04 ddietterich

@tetron yeah - that was the gist of the convo today that we effectively need the concept of secondary files.

@ddietterich The problem is that pretty much all downstream systems are going to take in one file (e.g. the bam file, or the reference fasta) and often implicitly (sometimes explicitly) expect other files to be there. So while the bundle could say "these are all the files that travel together" it's difficult to know which is the file to pass to standard tools, without embedding conventional wisdom everywhere.

geoffjentry avatar Apr 30 '19 20:04 geoffjentry

For posterity, my $0.02 in the discussion was that IMO we should explore removing the concept of file altogether and only have bundle (or some other name). What we'd call a file could always be represented as a bundle with a single file. We could then have some metadata beyond what we already do to specify things like primary file (which in a single file case would always be that one).

My reasoning was that while the underlying compute is on files, a lot of the plumbing is on "concepts", e.g. if i want a bam - i really want to be moving both it and its index around together. If I want the HG38 reference, I really want that whole pile of files moving around together. We'd have an overall simpler API (no more arguments about unification of URIs for bundles & objects.

The one key pushback was that this could be too large of a divergence from existing platforms' models which could hinder adoption

geoffjentry avatar Apr 30 '19 20:04 geoffjentry

I would be +1 for having only bundles, as it would align better with the Arvados model of collections.

tetron avatar Apr 30 '19 21:04 tetron

@geoffjentry Can you quantify "pretty much all downstream systems..."? That is a pretty broad claim. My concern is that, in fact, it is a narrow set of tools of a particular style that have that problem. I would not want to make a biomedical platform that only catered to those tools.

ddietterich avatar Apr 30 '19 21:04 ddietterich

@ddietterich I'll rephrase. In the genomics space - for better or for worse - it is an extremely common pattern, and ultimately this is a genomics oriented working group.

Where this topic arose was in a discussion about interoperability between multiple APIs, e.g. passing DRS URIs to WES or TES. There's a need to be able to understand "all of these files always travel together in a cohesive bundle" and "this is the one that's actually important when generating command lines". There are certainly other ways this can be done, but at least among the assembled group this was seemed to be the one which required the least amount of magic knowledge in the least number of places.

@sarpera was the one who had the first motivating use case, which reminds me that we should have captured it. Sarper, can you comment?

geoffjentry avatar Apr 30 '19 22:04 geoffjentry

The name of the bundle is what you pass along.

Expanding on @ddietterich snippet, see @delagoya's bundle example below:

{
 'id' : '4562',
 'name': 'foo.bam'
 ...
 'contents' : [
  { 'name': 'foo.bam', 'id': '1234', 'drs_uri': 'drs://dnsname/1234'},
  { 'name': 'foo.bam.bami', 'id": '1236', "drs_uri": 'drs://dnsname/1236'}
 ]
 ...
}

A DRS client would stage the bundle as a whole (with both objects) and pass along the locally materialised name foo.bam of the bundle to the downstream system. The DRS client could be the {de}staging system of client embedded in samtools if they choose to integrate with DRS.

susheel avatar May 01 '19 11:05 susheel

@susheel From my vantage point at least I think name would work. I'm not in love with the "magic" aspect of it, but it also wouldn't require any further changes to the spec itself.

geoffjentry avatar May 01 '19 12:05 geoffjentry

@geoffjentry I thought the GH in GA4GH stood for "genomics and health." I will continue to advocate that we embrace a broad definition of health data (like EHR) and a more inclusive set of biomedical use cases (like clinical).

ddietterich avatar May 01 '19 17:05 ddietterich

Point taken about genomics++ but I’m going to have to pull out a yellow card.

I think we know the issues and possible avenues for resolution. @sarpera will need to weigh in on the issue since it originally came from him if that is not the case.

Next step to issue resolution is a full example of a CWL/WDL + WES definition that uses DRS inputs with the solution that you are advocating for.

Keep in mind that while @susheel is showing the DRS response that would clarify some information for reasonable runtime conventions, at the point where you have a WDL + inputs.json and are submitting to a WES endpoint, you actually may not know that name mapping information.

delagoya avatar May 01 '19 20:05 delagoya

One thing to keep in mind is that we should be sure to allow for the case where the WES server is decoupled from the underlying language. In other words I don't think we can assume knowledge of WDL, CWL, and NF at the WES layer as it might just be unpacking the bundle and filling in the blanks.

geoffjentry avatar May 01 '19 21:05 geoffjentry

@geoffjentry

What we'd call a file could always be represented as a bundle with a single file.

I'm against going for this direction because not all actions a client can perform on a bundle and an object are necessarily the same. It would lead to ambiguous requests and responses in a RESTful API.

My intention was to draw attention to the "secondary files" or a similar case where files are logically coupled together. E.g .cram + .crai, .bam + .bai etc. Please note that this is already an issue outside of Cloud APIs. Secondary files are currently obtained by a regex match on the same file system directory in some systems, assuming they are already there.

Problem definition is: "if I have a DRS URL for a .cram file, how do I obtain the .crai file for it?"

Two possible solutions emerged in our discussions in the connect meeting (thanks @delagoya for creating the issue!):

1- Use a bundle to group coupled objects together (objects are decoupled and don't know about the relationship) 2- Objects know about the other related objects (objects are coupled at an object level via an object property)

None of them are perfect.

Option 1: DRS server needs to create a bundle for the coupled objects by default: 3 POST requests to have .crai, .cram and the bundle that contains them

Search API -> DRS URL -> WES endpoint flow (obtaining DRS URLs from a Search API response): Response of a search that maps to a single CRAM file would need to assume the DRS URL should be a bundle that contains both CRAM and CRAI object. It MAY not be the client's intention. Also just looking at a DRS URL for a CRAM object alone, how would the client obtain the CRAI object from a DRS server?

Validating workflow params in WES: Following above scenario, WES params for an input would be a bundle whereas the descriptor language would assume a single file (e.g CWL workflow that has type: File as an input). Providing a bundle would invalidate the excepted workflow parameter type.

Option 2: Objects in DRS server may have pointers/links to other objects: One-way pointers from one object to another using its DRS URL to establish a connection. Requires a new property on an object response. 2 POST requests to add a .crai that is linked from a .cram object.

Search API -> DRS URL -> WES endpoint flow (obtaining DRS URLs from a Search API response): No issues here.

Validating workflow params in WES: No issues for validating params. But staging the computation environment would somehow know that if an object has "related/coupled" objects, they also need to be fetched. This may complicate things on the WES side.

sarpera avatar May 03 '19 13:05 sarpera

@sarpera When you say POST requests, to what are you referring? I didn't think there were any POST endpoints in DRS - are you talking about the system needing to register those files/bundles? I don't think we should be assuming how a data repository is populating these values for DRS clients.

Something you alluded to but I don't think was explicit (sorry if I missed it) - because CWL (and likely WDL some day) allows a user to define secondary files via regex, if a system was passed a DRS url for foo.cram the only way it'd be able to work out that it also needed foo.cram.crai (or even foo.crai, depending on underlying tool) would be to interpret the CWL itself. If one had a WES system which was dispatching to multiple engines, this could prove difficult.

geoffjentry avatar May 03 '19 15:05 geoffjentry

@geoffjentry, the latter. In my example, system needs to mint 2 IDs first for 2 objects, and then mint a 3rd ID for the bundle to point to those 2 IDs that should already exist in the repository at that time. Well of course if it depends whether or not the IDs are minted per request.

Yes, CWL relies on regexes for secondary files. But assuming that all the resources (workflow, files etc) provided in a WES POST request will be "staged" in a computation environment (at least in cloud-powered platforms) prior to executing CWL, it wouldn't matter as long as all required objects are fetched one way or the other.

sarpera avatar May 03 '19 15:05 sarpera

@sarpera

re POSTing/minting: IMO that's a red herring here. At least at the moment IMO we should be focusing on client viewpoint and if I'm a client I don't particularly care what's necessary on the other side.

On your latter point, I thought the issue you raised the other day was that we either need to solve for a) "How does a WES endpoint know which further files to request?" or b) "If already provided all the files, which is the right one to stick in as the primary file for a CWL-stye File-with-secondaries input?"

I think this is also what you'er saying a couple of posts above. If what I said doesn't sound right to you, please correct.

geoffjentry avatar May 03 '19 17:05 geoffjentry

@geoffjentry right, I wanted to point out the complexity for both options is almost the same for the server. From the client viewpoint, "How does a WES endpoint know which further files to request?" emerges from Option 2, and "If already provided all the files, which is the right one to stick in as the primary file for a CWL-stye File-with-secondaries input?" from Option 1.

I'm slightly in favour of Option 2 since it does not require having bundles on a system and has less impact on "search" side of things. But yes, it implicitly requires WES to fetch associated files somehow. My objection to Option 1 is that I think this case with secondary files is different than what bundles are for because of primary-secondary relationship.

sarpera avatar May 06 '19 11:05 sarpera

Technically, CWL doesn't require searching for secondary files except as a convenience for the submitter. The submitter (client) can provide secondary files explicitly.

tetron avatar May 06 '19 14:05 tetron

However I think DRS would greatly benefit from a general pattern for describing dependencies of a file (because if it doesn't have it, that need doesn't go away, instead client applications and data model like CWL have to compensate.)

The CWL model is that any file can have secondary files, and those secondary files can have their own secondary files (and so forth, because dependencies are not always 1 level deep).

A model where we have a bundle with a specified primary file seems like it would also work (but only specify 1 level of dependencies?)

tetron avatar May 06 '19 14:05 tetron

@tetron You're right - if I'm submitting my own workflow I can make sure all secondary files are declared explicitly. It is likely that I'm submitting something off the shelf, referenced by a TRS ID. And it also feels likely that the author of that workflow used regexes

geoffjentry avatar May 06 '19 14:05 geoffjentry

@tetron I meant the Search API (linking patient data to raw/sequencing data via DRS URI). I agree that a way to describe dependencies is needed, I'm just not sure if the bundles are the way to solve this.

sarpera avatar May 06 '19 15:05 sarpera

Also note that we are now implying a semantic overload on the bundle concept that is not clearly stated anywhere.

For example:

  • bundleA --> 20 files belonging to a dataset
  • bundleB --> 2 files, 1 BAM and 1 BAI

this means that there is a lot of work at client level to understand what is what, in particular when the DRS client will not know what "type of bundles" it is.

I agree with @tetron and @sarpera. Maybe the way forward could be to have a "dedicated" concept for dependent files, which basically tells the DRS client: if you get fileA, most likely you want all the dependent files of fileA, which could be indexes or other types that are used in bioinformatics workflows.

At the same time we keep the Bundle idea simple, with the idea of using them as a way to indicate several files that are together.

Also this helps on the linking part, where the link is to the "real file", while all the accessory/dependent files will came if they exist, otherwise the user will have the option to re-create them running the appropriate tool (for example samtools index for a BAM).

mattions avatar May 06 '19 17:05 mattions

While I agree that we (Genomics Community) need the concept of related objects, IMO it is served by Bundles. Adding a dedicated concept of secondary_files directly into Objects muddies the waters between Bundles and Objects.

Q: Would supported_files also support the listing of secondary bundles?

As Ewan noted at Hinxton, "keep it simple, release, and iterate to improve" :) My vote is to keep the DRS standard simple (at least for v1) and explore the options via x-extensions and/or tooling support.

susheel avatar May 08 '19 14:05 susheel

What's the latest thinking here? I'm also a fan of keeping things simple (for now), let Bundles do the job of connecting primary and secondary files, and expecting the client to do a bit of 'reasoning' to figure out what objects get mapped to workflow inputs.

One model by which we could 'provide more information' about how files are linked could be something like the ResearchObject Manifest file — but I'm fairly naive when it comes to ResearchObjects in general. Anyway, I think some sort of extension or addition like this could be implemented as a non-breaking 'improvement' like @susheel suggested.

jaeddy avatar May 21 '19 15:05 jaeddy

Chiming in to note that we just ran into a real world example of this, albeit with DOS instead of DRS, tho as of this moment there'd be no real difference (DRS might actually make this trickier).

User has a DOS URI which resolves to an array of multiple files, e.g. the canonical BAM & Index. They have a WDL which specifies two inputs: File FooBam and File FooBai (yes, WDL doesn't have a great secondary file system). Something somewhere needs to be able to divine which of those files goes into FooBam and which goes into FooBai.

For now we're thinking of some workarounds but if we want APIs like DRS and WES to interoperate we'll need to consider who'll be on the hook for resolving these issues - and I don't think the answer should be "The end user", so it'll be one of the API layers

geoffjentry avatar May 21 '19 22:05 geoffjentry

Seems 1) pretty dated issue and 2) related to #337, #286, and #323. Should this issue be revived? If so we need a champion to be assigned the ticket. In the mean time I'm applying the Stale label.

briandoconnor avatar Jan 25 '21 20:01 briandoconnor

This has had two months as stale. No champion emerged. Suggest closing this.

If anyone returns to this issue suggest following up via the other issues mentioned #337, #286, and #323.

ianfore avatar Mar 22 '21 20:03 ianfore