ga4gh-schemas icon indicating copy to clipboard operation
ga4gh-schemas copied to clipboard

Result of querying a 0 length interval

Open lbergelson opened this issue 9 years ago • 13 comments

If you have a read store holding the following records in it.

Read1: [1,4) Read2: [2,5) Read3: [4,7)

If you submit a query with the with start = 3, end = 3, which reads do you get back?

   <--1-->
      <--2-->
            <--3-->
0  1  2  3  4  5  6  7  8  9  
         |    

(edited diagram because I realized the line wasn't in the right spot)

lbergelson avatar Mar 30 '15 19:03 lbergelson

With query start = 3 and end = 3 I think you had better get Read1 and Read2. Perhaps a more interesting question is what you get with start = 4 and end = 4. In that case I suggest just Read2?

Richard

On 30 Mar 2015, at 20:58, Louis Bergelson [email protected] wrote:

If you have a read store holding the following records in it.

Read1: [1,4) Read2: [2,5) Read3: [4,7)

If you submit a query with the with start = 3, end = 3, which reads do you get back?

<--1--> <--2--> <--3--> 0 1 2 3 4 5 6 7 8 9
|
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/270.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

richarddurbin avatar Mar 30 '15 22:03 richarddurbin

A closed-open range [3,3) is empty, so the query should return empty set. A closed-open range [3,4) is not empty and should return [Read1, Read2]. A closed-open range [4,4) is empty, so the query should return empty set. A closed-open range [4,5) is not empty, so the query should return [Read3]. A closed-open range [6,7) is not empty but does not overlap [4,7), so the query should return empty set.

I've found the Range class from Guava very useful for implementing such in java, e.g.

import com.google.common.collect.Range;

public final class isEmpty {
    public static void main(final String[] args) {
        Range<Integer> read1 = Range.closedOpen(1, 4);
        Range<Integer> read2 = Range.closedOpen(2, 5);
        Range<Integer> read3 = Range.closedOpen(4, 7);

        Range<Integer> empty = Range.closedOpen(3, 3);
        Range<Integer> notEmpty = Range.closedOpen(3, 4);

        System.out.println("empty overlaps read1? " + !empty.intersection(read1).isEmpty());
        System.out.println("empty overlaps read2? " + !empty.intersection(read2).isEmpty());
        System.out.println("notEmpty overlaps read1? " + !notEmpty.intersection(read1).isEmpty());
        System.out.println("notEmpty overlaps read2? " + !notEmpty.intersection(read2).isEmpty());
    }
}
$ javac -classpath guava-18.0.jar isEmpty.java
$ java -classpath guava-18.0.jar:. isEmpty
empty overlaps read1? false
empty overlaps read2? false
notEmpty overlaps read1? true
notEmpty overlaps read2? true

heuermh avatar Apr 08 '15 16:04 heuermh

Is guava range the interpretation that is being used for ga4gh? I have been told by a number of people that they interpret the range [3,3) as the point 3, not the empty set. If we interpret it as empty, than this query should return an error, rather than an empty query.

lbergelson avatar Apr 08 '15 17:04 lbergelson

No, I believe that is an incorrect interpretation.

The definition of a closed-open range [a..b) is {x | a <= x < b}. For the discrete domain of integers (or longs) there are no values of x that satisfy both a <= x and x < a.

An empty closed-open range is still a valid closed-open range (though it may not have any reasonable meaning in biology) so I feel that querying with one is valid and should return an empty result rather than an error.

It appears that there is more unresolved discussion about errors in #271.

heuermh avatar Apr 08 '15 18:04 heuermh

It may be an incorrect interpretation according to that definition of closed-open intervals, but is the interpretation that a large number of people prefer. See @richarddurbin's comment above you.

Separately, accepting that [3,3) is empty, I don't believe that makes it a valid query. A query which is by definition going to return nothing, is not a query anyone would deliberately perform. Therefore it should be an error condition in order to help catch bugs.

lbergelson avatar Apr 08 '15 18:04 lbergelson

If start:3, end:3 is interpreted by a lot of people as meaning 3, then perhaps a lot of people aren't really thinking about closed-open intervals, but true closed intervals (how would these people interpret start:3, end:4 then?). So with all due apologies about asking a potentially stupid question, but why are we defining intervals as closed-open in the first place?

Defining all intervals as closed would also trivially answer the concern about requesting empty intervals: The only way that could happen is if the user specified end < start, and there's a pretty clear argument for throwing an error in that case.

macieksmuga avatar Apr 08 '15 18:04 macieksmuga

@macieksmuga The [3,3) would be a 0-length interval at the point 3. [3,4) has a length of 1. There are some people who feel strongly that we need a way of representing 0-length intervals on the reference. The use case is to represent insertions occurring between existing bases. I believe that bed format allows 0 length intervals, so anyone who wants to be able to handle bed files properly needs a mechanism for dealing with them.

Defining all intervals as closed would also trivially answer the concern about requesting empty intervals: The only way that could happen is if the user specified end < start, and there's a pretty clear argument for throwing an error in that case.

You'd think so, until people demand that you can represent bed's 0 length intervals, and then you wind up with crazy things like [1,0] becoming a valid interval...

lbergelson avatar Apr 08 '15 19:04 lbergelson

Yes. And in addition to representing 0-length intervals, closed-open intervals still concatenate well. For example, [1,3) and [3,8) concatenate nicely to form [1,8).

On Wed, Apr 8, 2015 at 12:40 PM, Louis Bergelson [email protected] wrote:

@macieksmuga https://github.com/macieksmuga The [3,3) would be a 0-length interval at the point 3. [3,4) has a length of 1. There are some people who feel strongly that we need a way of representing 0-length intervals on the reference. The use case is to represent insertions occurring between existing bases. I believe that bed format allows 0 length intervals, so anyone who wants to be able to handle bed files properly needs a mechanism for dealing with them.

Defining all intervals as closed would also trivially answer the concern about requesting empty intervals: The only way that could happen is if the user specified end < start, and there's a pretty clear argument for throwing an error in that case.

You'd think so, until people demand that you can represent bed's 0 length intervals, and then you wind up with crazy things like [1,0] becoming a valid interval...

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/270#issuecomment-91014645.

haussler avatar Apr 08 '15 19:04 haussler

But they don't actually represent 0 length intervals if you use the definition that @heuermh provided.

lbergelson avatar Apr 08 '15 19:04 lbergelson

It is best to have the capability of representing 0 length intervals, even if in some circumstances it is considered an error to use them. -D

On Wed, Apr 8, 2015 at 12:52 PM, Louis Bergelson [email protected] wrote:

But they don't actually represent 0 length intervals if you use the definition that @heuermh https://github.com/heuermh provided.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/270#issuecomment-91017227.

haussler avatar Apr 08 '15 19:04 haussler

Thanks for clarifying the need for half-open intervals! On the other hand, I'm now puzzled by @lbergelson 's suggestion to disallow querying on empty intervals, which if I understood the above discussion correctly, are valid (and useful) entities:

Separately, accepting that [3,3) is empty, I don't believe that makes it a valid query. A query which is by definition going to return nothing, is not a query anyone would deliberately perform. Therefore it should be an error condition in order to help catch bugs.

macieksmuga avatar Apr 09 '15 17:04 macieksmuga

Going by the definition @heuermh provided for half open intervals, [a,a) == [b, b) even if a != b. And [a,a) intersect anything is also empty. Half open intervals only describe a meaningful 0 length interval if we have a different definition of what it means to be a half open interval.

lbergelson avatar Apr 09 '15 17:04 lbergelson

No. A subsequence is not just a set. The definition given by @heuermh just specified the characters to include. @heuermh did not say that [a,a) = [b,b) if a!= b. As in my previous post, a 0-length subsequence at different positions is different. For any subsequence, the start and the length are important. [a,b) starts at a and has length (b-a). This is clearly true if b>a and also true if b=a.

On 9 Apr 2015, at 18:41, Louis Bergelson [email protected] wrote:

Going by the definition @heuermh https://github.com/heuermh provided for half open intervals, [a,a) == [b, b) even if a != b. And [a,a) intersect anything is also empty. Half open intervals only describe a meaningful 0 length interval if we have a different definition of what it means to be a half open interval.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/270#issuecomment-91304582.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

richarddurbin avatar Apr 09 '15 21:04 richarddurbin