ga4gh-schemas
ga4gh-schemas copied to clipboard
Result of querying a 0 length interval
If you have a read store holding the following records in it.
Read1: [1,4) Read2: [2,5) Read3: [4,7)
If you submit a query with the with start = 3, end = 3, which reads do you get back?
<--1-->
<--2-->
<--3-->
0 1 2 3 4 5 6 7 8 9
|
(edited diagram because I realized the line wasn't in the right spot)
With query start = 3 and end = 3 I think you had better get Read1 and Read2. Perhaps a more interesting question is what you get with start = 4 and end = 4. In that case I suggest just Read2?
Richard
On 30 Mar 2015, at 20:58, Louis Bergelson [email protected] wrote:
If you have a read store holding the following records in it.
Read1: [1,4) Read2: [2,5) Read3: [4,7)
If you submit a query with the with start = 3, end = 3, which reads do you get back?
<--1--> <--2--> <--3--> 0 1 2 3 4 5 6 7 8 9
|
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/270.
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
A closed-open range [3,3)
is empty, so the query should return empty set.
A closed-open range [3,4)
is not empty and should return [Read1, Read2]
.
A closed-open range [4,4)
is empty, so the query should return empty set.
A closed-open range [4,5)
is not empty, so the query should return [Read3]
.
A closed-open range [6,7)
is not empty but does not overlap [4,7)
, so the query should return empty set.
I've found the Range class from Guava very useful for implementing such in java, e.g.
import com.google.common.collect.Range;
public final class isEmpty {
public static void main(final String[] args) {
Range<Integer> read1 = Range.closedOpen(1, 4);
Range<Integer> read2 = Range.closedOpen(2, 5);
Range<Integer> read3 = Range.closedOpen(4, 7);
Range<Integer> empty = Range.closedOpen(3, 3);
Range<Integer> notEmpty = Range.closedOpen(3, 4);
System.out.println("empty overlaps read1? " + !empty.intersection(read1).isEmpty());
System.out.println("empty overlaps read2? " + !empty.intersection(read2).isEmpty());
System.out.println("notEmpty overlaps read1? " + !notEmpty.intersection(read1).isEmpty());
System.out.println("notEmpty overlaps read2? " + !notEmpty.intersection(read2).isEmpty());
}
}
$ javac -classpath guava-18.0.jar isEmpty.java
$ java -classpath guava-18.0.jar:. isEmpty
empty overlaps read1? false
empty overlaps read2? false
notEmpty overlaps read1? true
notEmpty overlaps read2? true
Is guava range the interpretation that is being used for ga4gh? I have been told by a number of people that they interpret the range [3,3) as the point 3, not the empty set. If we interpret it as empty, than this query should return an error, rather than an empty query.
No, I believe that is an incorrect interpretation.
The definition of a closed-open range [a..b)
is {x | a <= x < b}
. For the discrete domain of integers (or longs) there are no values of x
that satisfy both a <= x
and x < a
.
An empty closed-open range is still a valid closed-open range (though it may not have any reasonable meaning in biology) so I feel that querying with one is valid and should return an empty result rather than an error.
It appears that there is more unresolved discussion about errors in #271.
It may be an incorrect interpretation according to that definition of closed-open intervals, but is the interpretation that a large number of people prefer. See @richarddurbin's comment above you.
Separately, accepting that [3,3) is empty, I don't believe that makes it a valid query. A query which is by definition going to return nothing, is not a query anyone would deliberately perform. Therefore it should be an error condition in order to help catch bugs.
If start:3, end:3
is interpreted by a lot of people as meaning 3
, then perhaps a lot of people aren't really thinking about closed-open intervals, but true closed intervals (how would these people interpret start:3, end:4
then?). So with all due apologies about asking a potentially stupid question, but why are we defining intervals as closed-open in the first place?
Defining all intervals as closed would also trivially answer the concern about requesting empty intervals: The only way that could happen is if the user specified end < start, and there's a pretty clear argument for throwing an error in that case.
@macieksmuga The [3,3) would be a 0-length interval at the point 3. [3,4) has a length of 1. There are some people who feel strongly that we need a way of representing 0-length intervals on the reference. The use case is to represent insertions occurring between existing bases. I believe that bed format allows 0 length intervals, so anyone who wants to be able to handle bed files properly needs a mechanism for dealing with them.
Defining all intervals as closed would also trivially answer the concern about requesting empty intervals: The only way that could happen is if the user specified end < start, and there's a pretty clear argument for throwing an error in that case.
You'd think so, until people demand that you can represent bed's 0 length intervals, and then you wind up with crazy things like [1,0] becoming a valid interval...
Yes. And in addition to representing 0-length intervals, closed-open intervals still concatenate well. For example, [1,3) and [3,8) concatenate nicely to form [1,8).
On Wed, Apr 8, 2015 at 12:40 PM, Louis Bergelson [email protected] wrote:
@macieksmuga https://github.com/macieksmuga The [3,3) would be a 0-length interval at the point 3. [3,4) has a length of 1. There are some people who feel strongly that we need a way of representing 0-length intervals on the reference. The use case is to represent insertions occurring between existing bases. I believe that bed format allows 0 length intervals, so anyone who wants to be able to handle bed files properly needs a mechanism for dealing with them.
Defining all intervals as closed would also trivially answer the concern about requesting empty intervals: The only way that could happen is if the user specified end < start, and there's a pretty clear argument for throwing an error in that case.
You'd think so, until people demand that you can represent bed's 0 length intervals, and then you wind up with crazy things like [1,0] becoming a valid interval...
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/270#issuecomment-91014645.
But they don't actually represent 0 length intervals if you use the definition that @heuermh provided.
It is best to have the capability of representing 0 length intervals, even if in some circumstances it is considered an error to use them. -D
On Wed, Apr 8, 2015 at 12:52 PM, Louis Bergelson [email protected] wrote:
But they don't actually represent 0 length intervals if you use the definition that @heuermh https://github.com/heuermh provided.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/270#issuecomment-91017227.
Thanks for clarifying the need for half-open intervals! On the other hand, I'm now puzzled by @lbergelson 's suggestion to disallow querying on empty intervals, which if I understood the above discussion correctly, are valid (and useful) entities:
Separately, accepting that [3,3) is empty, I don't believe that makes it a valid query. A query which is by definition going to return nothing, is not a query anyone would deliberately perform. Therefore it should be an error condition in order to help catch bugs.
Going by the definition @heuermh provided for half open intervals, [a,a) == [b, b) even if a != b. And [a,a) intersect anything is also empty. Half open intervals only describe a meaningful 0 length interval if we have a different definition of what it means to be a half open interval.
No. A subsequence is not just a set. The definition given by @heuermh just specified the characters to include. @heuermh did not say that [a,a) = [b,b) if a!= b. As in my previous post, a 0-length subsequence at different positions is different. For any subsequence, the start and the length are important. [a,b) starts at a and has length (b-a). This is clearly true if b>a and also true if b=a.
On 9 Apr 2015, at 18:41, Louis Bergelson [email protected] wrote:
Going by the definition @heuermh https://github.com/heuermh provided for half open intervals, [a,a) == [b, b) even if a != b. And [a,a) intersect anything is also empty. Half open intervals only describe a meaningful 0 length interval if we have a different definition of what it means to be a half open interval.
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/270#issuecomment-91304582.
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.