ga4gh-server icon indicating copy to clipboard operation
ga4gh-server copied to clipboard

SearchReads accepts only one ReadGroup ID, contrary to schema

Open hjellinek opened this issue 8 years ago • 36 comments

The schemas say that SearchReadsRequest accepts at least one ReadGroup ID, but the ref server accepts at most one.

record SearchReadsRequest {
   /**
   The ReadGroups to search. At least one readGroupId must be specified.
   */
   array<string> readGroupIds;
   ...
}

This test case is from a forthcoming compliance test, ReadsSearchIT.searchReadsWithAllIdsReturnsReadsForEach:

threeReadGroupIds = getThreeReadGroupIdsSomehow(); // details omitted (obviously :-))
SearchReadsRequest request =
                SearchReadsRequest.newBuilder()
                                  .setReadGroupIds(threeReadGroupIds)
                                  .setStart(0L)
                                  .setEnd(150L)
                                  .setReferenceId(refId)
                                  .build();
SearchReadsResponse response = client.reads.searchReads(request);

The call to searchReads throws a GAException:

{"errorCode":1151425451,"message":"Exactly one read group id must be specified"}

with HTTP status 501 (NOT_IMPLEMENTED).

This is contrary to the comments in the schema, not to mention that the declaration of the readGroupIds field is array<string>, not a single string.

hjellinek avatar Aug 28 '15 20:08 hjellinek

This is an intentional limitation of the reference server, as this feature is too complicated for us to implement at the moment. We will try to implement it at some point, but it is difficult to do with our stateless paging architecture.

The reference server's compliance score should be appropriately reduced for this.

jeromekelleher avatar Sep 01 '15 12:09 jeromekelleher

We've just proposed a change to the schemas that would make this feature "implementable". @diekhans

https://github.com/ga4gh/schemas/pull/570

david4096 avatar Mar 09 '16 00:03 david4096

There was strong objection to removing the cross-BAM merge-sort of reads from the GA4GH API. There are very good use cases for this in multi-sample variant calling and it is supported by htslib.

There should be no requirement to scale beyond small number of BAMs for the reference server.

It's very important the primary single BAM be supported very soon. This is obtaining all readgroups within a BAM as part of a single requests

diekhans avatar Mar 09 '16 23:03 diekhans

@jeromekelleher can you elaborate on why the limitation was put in place and what the roadmap to remove it would look like?

dcolligan avatar Mar 15 '16 13:03 dcolligan

From a @diekhans email:

Right now, we don't support the single most important use cases, which is read all the readgroups in a single BAM. The reference server can make the requirement that BAMs are sorted by location across all readgroups; this is almost always the case. It needs to implement the full API, even if this requires excluding some data.

[We need to] implement multiple readgroups within a single readgroupset. That should be easy. The API is not useful as-is. Cross-BAM reads is a bit more complex, as one needs to implement a merge.

dcolligan avatar Mar 16 '16 15:03 dcolligan

It seems like the searchReads endpoint is poorly suited to handle the "give me all reads in a BAM" functionality since it takes a readGroupId (or hopefully, in the future, an array of readGroupIds) argument. Since a ReadGroupSet maps to a BAM, wouldn't an endpoint that took a readGroupSetId be more appropriate? (And as a comment asks in the schema PR, what is a null readGroupSetId array supposed to mean?)

dcolligan avatar Mar 16 '16 21:03 dcolligan

The schema change I proposed above is similar to what you are suggesting @dcolligan, but it was a feature roll-back that wasn't going to be accepted.

Currently the endpoint takes an array of readgroupIds, it's just that the server currently under-implements what is in the schema. https://github.com/ga4gh/server/blob/31c41f4ea7b072dba43be79cd4db6111f41eb131/ga4gh/backend.py#L532

The ReadsIntervalIterator could be improved to allow multiple read groups to be specified, instead of being fixed to a parent container. One would then merge the results before returning them as part of _search.

david4096 avatar Mar 16 '16 23:03 david4096

I agree that the searchReads endpoint, implemented correctly to take N readGroupIds, could implement the "give me all reads in a BAM" functionality. However, that would require the client to know all of the readGroupIds contained in the BAM. That seems like an inconvenient way to address a BAM, which should ideally be done via a readGroupSetId. It also mandates an extra step for a client: getting a complete list of all of the readGroupIds that are in a readGroupSet/BAM.

From that schema thread it seems like we shouldn't abandon the current readsSearch implementation, just make it conform to the schema (i.e. accept multiple readGroupIds). Which, of course, is what this issue is all about.

However, that still doesn't cleanly address, in my view, the motivating "give me all reads in a BAM" use case. If we feel that use case is important -- and it seems that @diekhans does -- shouldn't we advocate for a new endpoint in the API that implements this functionality?

dcolligan avatar Mar 17 '16 13:03 dcolligan

Yes, it's inconvenient for the clients. However, lets revisit it independent. We need to move the server forwards.

Danny Colligan [email protected] writes:

I agree that the searchReads endpoint, implemented correctly to take N readGroupIds, could implement the "give me all reads in a BAM" functionality. However, that would require the client to know all of the readGroupIds contained in the BAM. That seems like an inconvenient way to address a BAM, which should ideally be done via a readGroupSetId. It also mandates an extra step for a client: getting a complete list of all of the readGroupIds that are in a readGroupSet/BAM.

From that schema thread it seems like we shouldn't abandon the current readsSearch implementation, just make it conform to the schema (i.e. accept multiple readGroupIds). Which, of course, is what this issue is all about.

However, that still doesn't cleanly address, in my view, the motivating "give me all reads in a BAM" use case. If we feel that use case is important -- and it seems that @diekhans does -- shouldn't we advocate for a new endpoint in the API that implements this functionality?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

diekhans avatar Mar 17 '16 14:03 diekhans

If we allow imposition of additional burdens on the client to solve this issue, our client already does "solve" the problem by making a request to searchReads one readGroupId at a time.

dcolligan avatar Mar 18 '16 13:03 dcolligan

We are not allowed to pass this off to client. That is very inefficient. These choices were made for very good reasons. Those reasons are not documented. BAMs are sorted by readgroupset, not readgroup.

diekhans avatar Mar 18 '16 14:03 diekhans

Question: if len(readGroupIds) > 1, in what order should the reads be returned? Should they be returned sorted by coordinate, or by readGroupId (and then by coordinate within readGroupId)? The schema doesn't specify this.

dcolligan avatar Mar 18 '16 14:03 dcolligan

And if the later, do we need to return the readGroupId reads in the order in which they were specified in the readGroupIds array, or can we do it in any order?

dcolligan avatar Mar 18 '16 14:03 dcolligan

I think you want to sort by position.

  The list of matching alignment records, sorted by position.
  Unmapped reads, which have no position, are returned last.

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/readmethods.avdl#L61

david4096 avatar Mar 18 '16 20:03 david4096

They should be returned sorted by coordinates across all read group. the schema should specify this; I think it's there, but obscure.

However, the server should not sort it. The reference server requirement for serving a BAM can be that all readgroups in a readgroupset are are sorted in a single BAM in sorted order.

This will almost always be the case. it becomes the data providers problem to create a conforming BAM, not the reference servers problem to sort it.

Danny Colligan [email protected] writes:

Question: if len(readGroupIds) > 1, in what order should the reads be returned? Should they be returned sorted by coordinate, or by readGroupId (and then by coordinate within readGroupId)? The schema doesn't specify this.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

diekhans avatar Mar 18 '16 20:03 diekhans

I think this is going to be harder than you think @diekhans, but I'll look into it and give an analysis over the next day.

jeromekelleher avatar Mar 21 '16 16:03 jeromekelleher

@jeromekelleher It's always harder than I think and easier than you think! Either way, I owe you a beer.

@dcolligan is also looking at it.

Jerome Kelleher [email protected] writes:

I think this is going to be harder than you think @diekhans, but I'll look into it and give an analysis over the next day.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

diekhans avatar Mar 21 '16 17:03 diekhans

@jeromekelleher and I have been talking about this, but here are some of the things that have come up (I, of course, still want to hear his opinion in this thread).

Firstly, no other API endpoint we have has an array of ids as a mandatory argument (we have some that have arrays as optional arguments). This introduces some problems, mostly of specification. For instance: what happens if some of the ids are invalid, but others are valid? What if there are duplicate ids? Is there a limit on how many ids can be sent, or should we just keep accepting ids until the server runs out of memory? Etc.

Secondly, although the intended use case may be searching for reads within the same BAM, there is nothing stopping the client from sending readGroupIds that are contained in different BAMs. In that case, how should we order the reads that come back? If we do need to return them by position, that would imply sorting (that is, merging) them across different BAMs.

Thirdly, given our stateless architecture, we need to keep track of the readGroupIds that were originally sent since we have no way of storing them on the server. This may be a large list, which means they will need to be transmitted on every request and response in the pageToken (as well as any indicies which are bookkeeping for whatever algorithm we implement). That is a lot of overhead.

All of this seems to point in the direction of instead implementing a readGroupSetSearch endpoint that takes one readGroupSetId as a mandatory argument. In this way, we avoid all of the problems above since we would be at most searching through one BAM.

dcolligan avatar Mar 22 '16 15:03 dcolligan

pysam shows the code to implement this is: iter = samfile.fetch("seq1", 10, 20) for x in iter: print (str(x)) once one figures out if the list of readgroups are all the groups in the readgroupset, where do the problems lie?

diekhans avatar Mar 22 '16 15:03 diekhans

That only searches one BAM, not N BAMs. That would be similar to the code that we would have in a readGroupSetSearch endpoint.

dcolligan avatar Mar 22 '16 15:03 dcolligan

Well, it looks like nobody actually implements multiple BAMs (Google don't apparently) so we may as well follow the crowd here. As an initial pass, how about we implement this as if it was passed a single ReadGroupSetId. That is, there are two legal inputs for the readGroupIds in the searchReads endpoint:

  1. Exactly one readGroupId, which we have implemented now; or
  2. k readgroup ids which correspond to exactly one ReadGroupSet. That is, all the readGroupIds belong to one ReadGroupSet, and all ReadGroups from that set are present. Therefore, we don't have to worry about filtering, and we don't have to worry about merging among an arbitrary number of BAMs. We might relax either of these assumptions in the future, or we might not.

I think the best thing would be to have two independent code paths to deal with this, so that we end up with something like:

if searchByReadGroup:
    intervalIterator = ReadsIntervalIterator(request, readGroup, reference)
else:
    intervalIterator = ReadsIntervalIterator(request, readGroupSet, reference)

In the future, hopefully sanity will prevail and we will split these two things (search through ReadGroup and search through ReadGroupSet) into different endpoints. For now, hopefully this workaround will cover the majority of the use cases that people actually have.

@dcolligan, what do you think, is this implementable? @diekhans, is this sufficient?

jeromekelleher avatar Mar 22 '16 15:03 jeromekelleher

  • Agreed that changing the API would make it easier to use. However I would rather code around it for now than wait on a multi-month discussion.
  • While not explicitly stated in any documentation, the server should be allowed to set limits on what it does. Implementing the API is different than scaling the API beyond the limits of the environment in which it is running.
  • The intended use case is to allow reading across read groups sets as a merge operation. This is very important and it was unanimously rejected when I proposed removing it of the last DWG. I think this can be done as a second step, but if this is a reference implementation, it has to implement it. It doesn't have to be highly efficient, especially with paging. I would be happy to wait until streaming to implement this.
  • The size of the page token would also be limited by the configured maximum read groups.

Internally, I would implement this as two paths as would be done with two endpoints. This would optimize the common case. We can propose an API chance in parallel.

The requirements for this part of the API are well considered, just never written down and most of the people involved are no longer contributing.

Danny Colligan [email protected] writes:

@jeromekelleher and I have been talking about this, but here are some of the things that have come up (I, of course, still want to hear his opinion in this thread).

Firstly, no other API endpoint we have has an array of ids as a mandatory argument (we have some that have arrays as optional arguments). This introduces some problems, mostly of specification. For instance: what happens if some of the ids are invalid, but others are valid? What if there are duplicate ids? Is there a limit on how many ids can be sent, or should we just keep accepting ids until the server runs out of memory? Etc.

Secondly, although the intended use case may be searching for reads within the same BAM, there is nothing stopping the client from sending readGroupIds that are contained in different BAMs. In that case, how should we order the reads that come back? If we do need to return them by position, that would imply sorting (that is, merging) them across different BAMs.

Thirdly, given our stateless architecture, we need to keep track of the readGroupIds that were originally sent since we have no way of storing them on the server. This may be a large list, which means they will need to be transmitted on every request and response in the pageToken (as well as any indicies which are bookkeeping for whatever algorithm we implement). That is a lot of overhead.

All of this seems to point in the direction of instead implementing a readGroupSetSearch endpoint that takes one readGroupSetId as a mandatory argument. In this way, we avoid all of the problems above since we would be at most searching through one BAM.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

diekhans avatar Mar 22 '16 15:03 diekhans

Yes. that is the first and most important use case.

More complex case is doing the merge. Although it's claimed that htslib support this, so it could be showing this through pysam.

Danny Colligan [email protected] writes:

That only searches one BAM, not N BAMs. That would be similar to the code that we would have in a readGroupSetSearch endpoint.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

diekhans avatar Mar 22 '16 16:03 diekhans

CH says google does implement it; but we can do this in in phases. Personally, I would wait till streaming for multi-bam.

Jerome Kelleher [email protected] writes:

Well, it looks like nobody actually implements multiple BAMs (Google don't apparently) so we may as well follow the crowd here. As an initial pass, how about we implement this as if it was passed a single ReadGroupSetId. That is, there are two legal inputs for the readGroupIds in the searchReads endpoint:

  1. Exactly one readGroupId, which we have implemented now; or
  2. k readgroup ids which correspond to exactly one ReadGroupSet. That is, all the readGroupIds belong to one ReadGroupSet, and all ReadGroups from that set are present. Therefore, we don't have to worry about filtering, and we don't have to worry about merging among an arbitrary number of BAMs. We might relax either of these assumptions in the future, or we might not.

I think the best thing would be to have two independent code paths to deal with this, so that we end up with something like:

if searchByReadGroup: intervalIterator = ReadsIntervalIterator(request, readGroup, reference) else: intervalIterator = ReadsIntervalIterator(request, readGroupSet, reference)

In the future, hopefully sanity will prevail and we will split these two things (search through ReadGroup and search through ReadGroupSet) into different endpoints. For now, hopefully this workaround will cover the majority of the use cases that people actually have.

@dcolligan, what do you think, is this implementable? @diekhans, is this sufficient?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

diekhans avatar Mar 22 '16 16:03 diekhans

Ok, it looks like we can compromise on @jeromekelleher 's plan

dcolligan avatar Mar 22 '16 16:03 dcolligan

Not according to @calbach @diekhans:

https://github.com/ga4gh/schemas/pull/570#issuecomment-194481531

The Google API's SearchReads accepts one or more read group set IDs and returns a paginated stream of reads which are ordered by position.

Note: We do not support searching by arbitrary readGroupIds from different read group sets (only from the same read group set) as we remain skeptical that this is a useful query. We may add support in the future - this would not be a significant amount of work, but it's not been requested.

So, Google allows search on a subset of the ReadGroups from a ReadGroupSet but does not support merge across ReadGroupSets.

I remain entirely unconvinced that merging across ReadGroups can't be don't more effectively and simply on the client side --- users will have to anyway if they want data from different servers. But this is besides the point: @dcolligan, lets go with the plan above if you're happy with it.

jeromekelleher avatar Mar 22 '16 16:03 jeromekelleher

Sorry if my reply was misleading. We absolutely support merge across multiple ReadGroupSets, when specified in the readGroupSetIds field (which is not part of the GA4GH spec).

calbach avatar Mar 22 '16 17:03 calbach

Interesting... can you supply a link to the docs for this endpoint please @calbach? Thanks for the clarification!

jeromekelleher avatar Mar 22 '16 18:03 jeromekelleher

My understanding from the GA4GH DWG call was that, as Mark says, merging results (sorted by position) across multiple readGroups is absolutely an important requirement. There was genuine push-back during that call on any attempt to adjust the API in the direction discussed above, so I don't think that's a real possibility going forward.

Which leaves us with ensuring the reference server does its job, namely, fulfills the requirements of the API: Doing this for multiple readGroups within a single readGroupSet should be relatively trivial for the reference server implementation (as Mark noted above), and the general use-case should certainly be doable, provided the claimed HTSLib support for this functionality exists in pysam (and even without such library-level support, merge-sorting multiple result sets on the fly is doable in the near term).

Also, I'm perhaps misunderstanding @dcolligan's earlier voiced concerns with the ReadGroupIDs in the request. As I understand it, passing in multiple ReadGroupID's per page request is exactly in keeping with the spirit of the API as it currently stands, and is directly analogous to passing in CallSetID's in the variants API. I believe this overhead would be trivially small relative to the volume of data returned, and would presumably be resolved altogether in a future streaming version of the API. Is this correct, or am I missing something important?

macieksmuga avatar Mar 22 '16 18:03 macieksmuga

@jeromekelleher sure: https://cloud.google.com/genomics/reference/rest/v1/reads/search

calbach avatar Mar 22 '16 21:03 calbach