pyranges icon indicating copy to clipboard operation
pyranges copied to clipboard

Inconsistent handling of bookended intervals

Open marco-mariotti opened this issue 2 years ago • 3 comments

There are inconsistencies in how to treat bookended intervals ( = adjacent intervals = the Start of one is the End of the other) among methods dealing with overlap.

Make some intervals: p1 (2 intervals with names a, b), p2 (one interval with name c, bookended with b), p3 (concatenation of p1, p2)

p1 = pr.from_dict({'Chromosome': ['chr1', 'chr1'], 'Start': [1, 3], 'End': [2, 6],
                   'Name': ['a', 'b']})
p2 = pr.from_dict({'Chromosome': ['chr1'], 'Start': [6], 'End': [7],
                   'Name': ['c']})
p3 = pr.concat([p1, p2])
p3

+--------------+-----------+-----------+------------+
| Chromosome   |     Start |       End | Name       |
| (category)   |   (int32) |   (int32) | (object)   |
|--------------+-----------+-----------+------------|
| chr1         |         1 |         2 | a          |
| chr1         |         3 |         6 | b          |
| chr1         |         6 |         7 | c          |
+--------------+-----------+-----------+------------+
Unstranded PyRanges object has 3 rows and 4 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

The overlap method does not consider bookended as overlapping. This is, IMHO, the "reasonable" default behavior.

p1.overlap(p2)

Empty PyRanges

The same for intersect, set_intersect, set_union. (Note: none of these methods accept the slack argument.)

Methods cluster, merge, and max_disjoint on the other hand, consider bookended as overlapping by default:

p3.cluster()
+--------------+-----------+-----------+------------+-----------+
| Chromosome   |     Start |       End | Name       |   Cluster |
| (category)   |   (int32) |   (int32) | (object)   |   (int32) |
|--------------+-----------+-----------+------------+-----------|
| chr1         |         1 |         2 | a          |         1 |
| chr1         |         3 |         6 | b          |         2 |
| chr1         |         6 |         7 | c          |         2 |
+--------------+-----------+-----------+------------+-----------+
Unstranded PyRanges object has 3 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

p3.merge()
+--------------+-----------+-----------+
| Chromosome   |     Start |       End |
| (category)   |   (int32) |   (int32) |
|--------------+-----------+-----------|
| chr1         |         1 |         2 |
| chr1         |         3 |         7 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 2 rows and 3 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

p3.max_disjoint()
+--------------+-----------+-----------+------------+
| Chromosome   |     Start |       End | Name       |
| (category)   |   (int32) |   (int32) | (object)   |
|--------------+-----------+-----------+------------|
| chr1         |         1 |         2 | a          |
| chr1         |         3 |         6 | b          |
+--------------+-----------+-----------+------------+
Unstranded PyRanges object has 2 rows and 3 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

In these three methods (cluster, merge, and max_disjoint), the "reasonable" behavior can be invoked with slack=-1:

p3.cluster(slack=-1)
+--------------+-----------+-----------+------------+-----------+
| Chromosome   |     Start |       End | Name       |   Cluster |
| (category)   |   (int32) |   (int32) | (object)   |   (int32) |
|--------------+-----------+-----------+------------+-----------|
| chr1         |         1 |         2 | a          |         1 |
| chr1         |         3 |         6 | b          |         2 |
| chr1         |         6 |         7 | c          |         3 |
+--------------+-----------+-----------+------------+-----------+
Unstranded PyRanges object has 3 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

p3.merge(slack=-1)
+--------------+-----------+-----------+
| Chromosome   |     Start |       End |
| (category)   |   (int32) |   (int32) |
|--------------+-----------+-----------|
| chr1         |         1 |         2 |
| chr1         |         3 |         6 |
| chr1         |         6 |         7 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 3 rows and 2 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

p3.max_disjoint(slack=-1)
+--------------+-----------+-----------+------------+
| Chromosome   |     Start |       End | Name       |
| (category)   |   (int32) |   (int32) | (object)   |
|--------------+-----------+-----------+------------|
| chr1         |         1 |         2 | a          |
| chr1         |         3 |         6 | b          |
| chr1         |         6 |         7 | c          |
+--------------+-----------+-----------+------------+
Unstranded PyRanges object has 3 rows and 3 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

On the other hand, method join has the reasonable behavior by default:

p1.join(p2)

Empty PyRanges

But its slack argument has a different interpretation. To include overlaps between bookended intervals, you must use slack=1:

p1.join(p2, slack=1)
+--------------+-----------+-----------+------------+-----------+-----------+------------+
| Chromosome   |     Start |       End | Name       |   Start_b |     End_b | Name_b     |
| (category)   |   (int32) |   (int32) | (object)   |   (int32) |   (int32) | (object)   |
|--------------+-----------+-----------+------------+-----------+-----------+------------|
| chr1         |         3 |         6 | b          |         6 |         7 | c          |
+--------------+-----------+-----------+------------+-----------+-----------+------------+
Unstranded PyRanges object has 1 rows and 7 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

The request is to:

  • have an homogeneous interpretation of "slack" in join vs other methods that accept it. IMHO join has the correct interpretation.
  • make the default behaviour of all methods mentioned above to not treat bookended as overlapping (i.e. if present, slack defaults to 0, behaves like in current join)
  • at some point, add an optional slack argument also to overlap, intersect, set_intersect, set_union

(Kudos to @JoanPAAL in my lab for noticing and writing the explanation)

marco-mariotti avatar Mar 09 '23 11:03 marco-mariotti

intersect, set_intersect, set_union, and join are binary operations.

cluster, merge, and max_disjoint are unary.

It is not that weird (to me, at least) that the first group requires overlap and the second requires adjacency.

This is similar to what bedtools does:

head test*bed                                                                                                                                  (py3_11_3)
==> test.bed <==
chr1	6	7

==> test2.bed <==
chr1	5	6

cat test2.bed test.bed > test_both.bed

bedtools merge -i test_both.bed                                                                                                                (py3_11_3)
chr1	5	7

bedtools intersect -wo -a test.bed -b test2.bed # same as pyranges.join
# no result

Both to change the API or use a different one than the most well-known tool in the field is likely to cause confusion.

However, when it comes to the behavior of slack, I think it should be the same within the unary and binary groups.

endrebak avatar May 01 '23 11:05 endrebak

I'm OK with adding slack to the other binary methods.

I think it is as simple as:

  1. extend the first pyranges in the comparison with <slack>
  2. perform the operation
  3. remove the slack from the output so that the results are like what went in.

Are one of you up for adding it?

The join method shows how to do it. I think all the functions should reuse the same add_temp_slack/remove_temp_slack methods though.

endrebak avatar May 01 '23 11:05 endrebak

OK, so let's keep the meaning of slack as in bedtools. @JoanPAAL will add convenient warnings to the documentation of the methods to clarify whether bookended intervals are merged/considered overlaps etc

marco-mariotti avatar May 16 '23 09:05 marco-mariotti