GenomicFeatures.jl
GenomicFeatures.jl copied to clipboard
Ordering of genomic intervals only works as expected when parameterised by the same type
Comparison of intervals isless
, isordered
, precedes
of intervals is defined for intervals with identical type parameters for their metadata, for example:
Base.isless(a::Interval{T}, b::Interval{T}, seqname_isless::Function=isless) where T
This causes some surprises when comparing intervals with different parametric types as it falls back on basic_isless
defined in IntervalTrees.jl
and does not compare seqname
:
## Comparing Interval{String} with Interval{String}
julia> Interval("chr2", 1, 10, '.', "") < Interval("chr1", 11, 20, '.', "")
false # expected
julia> @which Interval("chr2", 1, 10, '.', "") < Interval("chr1", 11, 20, '.', "")
isless(a::Interval{T}, b::Interval{T}) where T in GenomicFeatures at ... .julia/packages/GenomicFeatures/bVCwr/src/interval.jl:82
## Comparing Interval{Int64} with Interval{String}
julia> Interval("chr2", 1, 10, '.', 1) < Interval("chr1", 11, 20, '.', "")
true # not expected, only compares first and last
julia> @which isless(Interval("chr2", 1, 10, '.', 1), Interval("chr1", 11, 20, '.', ""))
isless(u::IntervalTrees.AbstractInterval, v::IntervalTrees.AbstractInterval) in IntervalTrees at ... .julia/packages/IntervalTrees/wh2ex/src/IntervalTrees.jl:30
It would be super useful to be able to order intervals when their metadata was of different types. The most obvious change would be to allow different type parameters:
Base.isless(a::Interval{T}, b::Interval{V}, seqname_isless::Function=isless) where {T, V}
However, could this cause issues breaking ordering equality conditions, as it makes sense to define isequal for two intervals that have equal metadata. So for two intervals x
and y
with identical coordinates but different metadata you would have all of isless(x, y)
, isless(y, x)
and isequal(x, y)
being false.
My use case is that I want to calculate distances between intervals with different metadata and using functions like sort
and searchsorted
are useful in doing this.
Would be happy if isordered
was changed to isordered(a::Interval{T}, b::Interval{V}, seqname_isless::Function=isless) where {T, V}
, and then use: searchsorted(intervals, iv, lt=GenomicFeatures.isordered)
, unless anyone sees any issues in doing so?
I think I've got upstream code that addresses your use case. But let me know if it doesn't and I'll have another look at your suggestion. I've linked the relevant code below.
https://github.com/BioJulia/GenomicFeatures.jl/blob/2bad5693c4e8c09c8c5c5e8c92f7c6005c6567a1/src/interval.jl#L86-L137
If you want to test the code, add the following packages.
add https://github.com/CiaranOMara/IntervalTrees.jl#pu/v1.1.0
add GenomicFeatuers#pu/v3.0.0
The main changes are that the Interval
type is renamed GenomicInterval
and now subtypes AbstractGenomicInterval
; and IntervalCollection
is now GenomicIntervalCollection
and the element type of the collection is now the AbstractGenomicInterval
instead of the interval's metadata type. The rest should be familiar.
These branches are fairly stable as I'm using them for my own work. The reason they haven't been merged into master is that it doesn't seem correct/reasonable to run ahead with GenomicFeatures v3 before bring all of the other packages up to date with GenomicFeatures v2.
I've moved the relevant code down to the head of the develop branch. So the functionality you require without the new types is now also an option for you with dev GenomicFeatures
.