ga4gh-schemas icon indicating copy to clipboard operation
ga4gh-schemas copied to clipboard

Queries off the end of a reference

Open lbergelson opened this issue 9 years ago • 26 comments

If a query is performed that requests reads from an area larger than the reference, what is the result?

i.e. Reference is 1000 bp long. Query starts at 900, ends at 1500. Are all the reads overlapping 900->1000 returned? Or is it an error?

lbergelson avatar Mar 30 '15 20:03 lbergelson

I think an error.

On 30 Mar 2015, at 21:01, Louis Bergelson [email protected] wrote:

If a query is performed that requests reads from an area larger than the reference, what is the result?

i.e. Reference is 1000 bp long. Query starts at 900, ends at 1500. Are all the reads overlapping 900->1000 returned? Or is it an error?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/271.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

richarddurbin avatar Mar 30 '15 22:03 richarddurbin

An error sounds reasonable --- should we update the schema to formalise the behaviour in these types of condition?

jeromekelleher avatar Mar 31 '15 08:03 jeromekelleher

I think the schema should be very explicit about what happens with conditions like this. I don't have a preference in this case about what the answer is, as long as it is well defined and doesn't vary from implementation to implementation.

lbergelson avatar Mar 31 '15 14:03 lbergelson

I couldn't agree more. However, there is really no discussion at all of error conditions in the schema right now, so it's not obvious how it should be done. Have you any thoughts on this?

jeromekelleher avatar Mar 31 '15 14:03 jeromekelleher

If you trigger an error when any part of the query is out of bounds, then the client needs to be very careful to maintain a list of references and lengths when making queries. Not a bad idea, but sometimes this is impractical or unnecessary when the edge cases are very infrequent.

My preference would be to not return anything when the query is wholly out of bounds. It's simpler, which means less to implement. The sanity checking can be left up to the client, which can elect to do so depending on application.

In response to the original prompt, this would mean that the reads in the reference range get returned, even if part of the range isn't in the reference proper.

ekg avatar Mar 31 '15 16:03 ekg

I think returning an error is the right thing. Returning nothing is too ambiguous. Having queries that are out of bounds (or inconsistent with the assembly structure- i.e. in gaps) tends to happen when you have the wrong reference. I see this happen more often than I would like- a group thinks they are on GRCh37 but do their analysis on GRCh38 (or vise versa). Explicit errors for error stats would be my preference- it is safer and there are less assumptions made.

deannachurch avatar Mar 31 '15 19:03 deannachurch

Yes, +1 for Deanna's view. No result returned is a meaningful result in many bioinformatics related queries. This is not twitter; precision should be an important goal of any query.

Deanna Church [email protected] writes:

I think returning an error is the right thing. Returning nothing is too ambiguous. Having queries that are out of bounds (or inconsistent with the assembly structure- i.e. in gaps) tends to happen when you have the wrong reference. I see this happen more often than I would like- a group thinks they are on GRCh37 but do their analysis on GRCh38 (or vise versa). Explicit errors for error stats would be my preference- it is safer and there are less assumptions made.

— Reply to this email directly or view it on GitHub.*

diekhans avatar Mar 31 '15 21:03 diekhans

Note for later: Tweet your query to @ga4ghapi...

On 31 Mar 2015, at 23:08, Mark Diekhans [email protected] wrote:

Yes, +1 for Deanna's view. No result returned is a meaningful result in many bioinformatics related queries. This is not twitter; precision should be an important goal of any query.

Deanna Church [email protected] writes:

I think returning an error is the right thing. Returning nothing is too ambiguous. Having queries that are out of bounds (or inconsistent with the assembly structure- i.e. in gaps) tends to happen when you have the wrong reference. I see this happen more often than I would like- a group thinks they are on GRCh37 but do their analysis on GRCh38 (or vise versa). Explicit errors for error stats would be my preference- it is safer and there are less assumptions made.

— Reply to this email directly or view it on GitHub.*

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/271#issuecomment-88249049.

mbaudis avatar Mar 31 '15 21:03 mbaudis

Errors are ideal, but there is an economy to features of the API. Each new error type will need to be implemented by servers and handled by clients.

It's also not clear to me that this will help clients avoid problems with mismatched references and data unless they are working close to chromosome boundaries or there are large differences in chromosome lengths between versions. There will be many other signals, such as mismatched bases in the reference, that are more likely to be detected by users and less easy to guard against by protective error returns. We are not going to be able to avoid user confusion by design.

So I agree that errors are probably the most correct thing to do. But, features have costs, and until this is a non-hypothetical issue I'd prefer to leave it out of the specification.

As a non-hypothetical example, samtools faidx will return an empty sequence when it is targeted outside of bounds. However, it returns an error when the chromosome name does not exist in the FASTA index. Similarly, samtools view doesn't return an error when it's given a range completely outside of a chromosome (it just returns nothing) and also returns an error when the chromosome doesn't exist. This seems to have been an acceptable API to users. I'd be curious what @lh3 thinks.

ekg avatar Apr 01 '15 10:04 ekg

Well defined error conditions are critical for an API in my view. I don't have a strong opinion about this particular situation, but in general we should define the behaviour of the server in all situations. If we don't, then it's up to server implementations whether they return an error or no results, and then client code ends up being written against the behaviour of a particular server. Suppose I develop my code on server X, that happens to return no results in the out-of-bounds condition here. As a client side developer, I implicitly assume that this is the behaviour mandated by the protocol, so I write my code accordingly and build this assumption in to my code. Some time later, I need to scale my code so I run it on server Y, which happens to throw errors. Now my code doesn't work any more. If we want portability of client code across different servers, we must specify the behaviour of the server in every possible condition.

Well defined error conditions are an important part of this. I don't think it's a major burden to implement these once they are specified. As a server developer, I would much rather have a cast iron spec to develop towards than to have to try and guess what the correct behaviour is (see ga4gh/server#308 for an example of where wooly specification is causing us problems).

So, the question for me is, how do we define these error conditions, their results, and ensure that they are thrown in a verifiable manner?

jeromekelleher avatar Apr 01 '15 11:04 jeromekelleher

I appreciate the difficulties of implementing error conditions, and having them be consistent. I do think it is important to identify states that are truly errors (running off the end of the chromosome qualifies I believe) and define how to handle these. I would also assert that just because you've heard no complaints about how an API is implemented does not mean there is not a problem. It may in fact be OK, or people could just be coding around it (or missing it) without complaining/reporting a problem.

deannachurch avatar Apr 01 '15 15:04 deannachurch

Erik Garrison [email protected] writes:

Errors are ideal, but there is an economy to features of the API. Each new error type will need to be implemented by servers and handled by clients.

Absolutely, features do have costs, but accurate exchange of data is important and we have to clearly define and implement the API. Cost is in the server and conformance suites. Clients general just handle most errors the same.

While checking for out-of-bounds coordinates isn't the most powerful or predictable sanity check, it has caught me many times. Admittedly, I am biased towards doing whole genome analysis.

Bioinformatics data has a huge problem of being imprecisely engineered. Yes, the data is noisy, but how the data is stored should not be noisy. Trying to understand many of our data sets, even some of the better ones, is often a exercise in forensics.

Having a very precisely define API might create more work for the specification and server developers, but it simplifies the whole systems. One knows what to expect, not having to guess and code around different responses and behavior.

Of course, none of this is perfect and it takes time. GA4GH should do things as exact as possible. Save time by delaying features rather than cutting corners.

Mark

diekhans avatar Apr 01 '15 16:04 diekhans

It seems that the consensus on this issue is that we should have an error. The question now is, how do we specify this? Some options:

  1. Describe the error conditions in the comments, and note that an error should be thrown in these conditions. Not further specification given.
  2. Describe the error conditions in the comments, and specify the error code that should be filled in to the GAException.errorCode for each condition. We must then define the error codes and their meanings somewhere in the specification.
  3. Describe the error conditions in the comments, and specify an error descriptor from an enum that we define. We would need to add a new field to the GAException type to allow us to send this error descriptor to the client.
  4. Describe the error conditions in the comments, and specify a different Exception type for each error that can occur. Formally specify that the method throws these types of exceptions. (Does Avro allow us to throw multiple exceptions from a method?)

Thoughts?

jeromekelleher avatar Apr 02 '15 09:04 jeromekelleher

I would go for 2 or 3. 4 looks significantly heavier to me for little if no gain.

Apologies if I am naive, but is there a possibility for a registered error handler to reject an error, passing it back to the thrower? Then the throwing code could do something like

if (out of bounds && throwErrror (OutOfBoundsError)) { return empty results ; }

and Erik could get his preferred behaviour by registering a handler that rejects the error.

I guess only some errors could be returned - there may be some which are unrecoverable. Maybe none of this makes any sense in a server/client environment.

Richard

On 2 Apr 2015, at 10:05, Jerome Kelleher [email protected] wrote:

It seems that the consensus on this issue is that we should have an error. The question now is, how do we specify this? Some options:

Describe the error conditions in the comments, and note that an error should be thrown in these conditions. Not further specification given. Describe the error conditions in the comments, and specify the error code that should be filled in to the GAException.errorCode for each condition. We must then define the error codes and their meanings somewhere in the specification. Describe the error conditions in the comments, and specify an error descriptor from an enum that we define. We would need to add a new field to the GAException type to allow us to send this error descriptor to the client. Describe the error conditions in the comments, and specify a different Exception type for each error that can occur. Formally specify that the method throws these types of exceptions. (Does Avro allow us to throw multiple exceptions from a method?) Thoughts?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/271#issuecomment-88835928.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

richarddurbin avatar Apr 02 '15 09:04 richarddurbin

I like (2) & (3) best also. It would be a lot of Avro plumbing to do (4), for not a lot of gain.

A client library can do the sort of things that you are talking about here @richarddurbin. I think it would be a lot of complexity in the server to make errors optionally recoverable, for no real advantage. I'm not really convinced that it's ever a good idea to mask errors from the application programmer though.

jeromekelleher avatar Apr 02 '15 12:04 jeromekelleher

I initially didn't have a strong opinion on the result of one of these queries, but after thinking about it I'm more of the opinion that a query partially off the reference should return all the overlapping reads if any.

I have the following example of why this might be desirable behavior.
It's very reasonable that someone would want to take every read in some set of reads, and query for reads that overlap that read. In the case of a read that is aligned off the end of the contig, with the current the proposal, this query would produce an error. It seems unexpected that querying the start/end position of an existing read would produce an error.

Sorry for jumping in late.

lbergelson avatar Apr 02 '15 14:04 lbergelson

@lbergelson this is exactly the kind of situation I'm referring to as presenting an additional difficulty for clients. If we implement this specific error, all clients will need to maintain complete maps of the entire reference system. Perhaps this is desirable, but it increases the coordination cost of the distributed system.

ekg avatar Apr 02 '15 14:04 ekg

@ekg Yes, it definitely is. I just hadn't come up with a realistic situation when you would end up with queries that look like that. Since you can handle queries from start -> end of chromosome by just not specifying an end, most of these issues are just avoided. Any time you're generating positions externally it seems reasonable to have to sanity check them, but this is a situation where querying the results of a previous query in a straight forward way could lead to an error.

lbergelson avatar Apr 02 '15 14:04 lbergelson

Having very clearly defined and expressed errors is really important for writing robust code.

We should absolute avoid any special cases, like parameters that control if an error is returned or not. Every if-then-else makes the API more complex.

Bioinformatics people can learn to be precise programmers. It's far easier in the long run to not have so many possible failure cases.

We need to concentrate on making the API straight-forward and write language specific, client libraries to make programming easier.

diekhans avatar Apr 02 '15 14:04 diekhans

I don't think it's quite as bad as you are making out @ekg. There are two options to cover this case cleanly:

  1. The client needs to make a request to ask about the size of the reference before hand. (Clients don't need to know anything).
  2. We make end nullable, and make this mean 'until the end of the reference'.

Using (2) clients can conveniently perform the required queries you mention. If they explicitly provide a value that is too large we consider this to be an error and return one.

update

Sorry, I should have read @lbergelson's example more carefully. This is quite a subtle corner case, and not that easily handled with a null end coordinate.

jeromekelleher avatar Apr 02 '15 15:04 jeromekelleher

Jerome Kelleher [email protected] writes:

  1. We make end nullable, and make this mean 'until the end of the reference'.

This is clear and precise. Making up a large number isn't.

diekhans avatar Apr 02 '15 16:04 diekhans

End is already nullable with that behavior according to the spec.

lbergelson avatar Apr 02 '15 17:04 lbergelson

Yes, sorry @lbergelson, I was thinking about variants. The question is though, is this corner case worth losing the error handling assurances? Is it more common for people to make mistakes that could be caught early by having the server do bounds checking, or to need to tackle the situation you mention above?

jeromekelleher avatar Apr 02 '15 17:04 jeromekelleher

@jeromekelleher Oh, I didn't realize it was different for variants. Whatever the decision is, it should be consistent across the different query api's.

I'm not certain which is more important. I'm currently leaning towards the convenience of not having to explicitly clip to the end position in every query to the known reference. Of course this could be done automatically by the client library, but in that case you also lose the error checking.

I'm not sure what types of errors will be caught by returning an error. It seems likely it will mostly be off by 1 errors, or possibly people accidentally using the wrong reference. Incorrect reference errors seem like they will be already implicitly handled because queries must specify the reference. Maybe this could catch users typing the wrong reference set name, but that seems like a low priority issue and probably this isn't the right way to deal with it. (better to encode it in the analysis somehow instead of typing things manually...)

lbergelson avatar Apr 02 '15 17:04 lbergelson

I agree @lbergelson --- it's not obvious what the right answer is here!

jeromekelleher avatar Apr 02 '15 17:04 jeromekelleher

I think it practice it is likely that we would just clip the end position of every query to avoid an error. So that would defeat the point of having the error.

lbergelson avatar Apr 02 '15 17:04 lbergelson