GenomicFeatures.jl icon indicating copy to clipboard operation
GenomicFeatures.jl copied to clipboard

can you add a function merge?

Open panxiaoguang opened this issue 3 years ago • 6 comments

Although it's a small problem, merging adjacent intervals still requires a lot of code. Can you provide such a function

panxiaoguang avatar Apr 20 '21 00:04 panxiaoguang

@panxiaoguang it might be helpful to provide a MWE of code that gives the sorts of inputs you're talking about, and then a few examples of what you'd like this function to do (how you're doing it currently, and what you think a good interface would be)

kescobo avatar Apr 26 '21 19:04 kescobo

Below is a generic example of how I would begin to approach partitioning a collection of intervals, and the merging of partitions.

using GenomicFeatures

intervals = [Interval.("chr1", 1:2:10, 2:2:10); Interval.("chr2", 1:2:10, 2:2:10)]

col = IntervalCollection(intervals)

function custom_merge(interval_a::Interval{Nothing}, interval_b::Interval{Nothing})
    return Interval(seqname(interval_a), leftposition(interval_a), rightposition(interval_b))
end

function custom_merge(interval::Interval{Nothing})
    return interval
end

function custom_merge(intervals::Vector)
    return custom_merge(intervals...)
end

function custom_partitioner(tree::GenomicFeatures.IntervalTrees.IntervalTree, p::Integer=2)
    return Iterators.partition(tree, p)
end

function process(f::Function, col::IntervalCollection)

    # Generate iterators for each tree.
    iterators = Iterators.Generator(f, values(col.trees))

    return Iterators.flatten(iterators)
end

merged = map(custom_merge, process(custom_partitioner, col))
merged = custom_merge.(process(custom_partitioner, col))

CiaranOMara avatar Jun 12 '21 16:06 CiaranOMara

return Interval(seqname(interval_a), leftposition(interval_a), rightposition(interval_b))

how about a contains b ? then merged interval should be a

panxiaoguang avatar Jun 17 '21 02:06 panxiaoguang

If GenomicFeatures.jl can have all features of bedtools, I think it will be a great tools for every bioinformatics people

panxiaoguang avatar Jun 17 '21 02:06 panxiaoguang

I would welcome this feature. It would also be nice to see other interval set operations such as differences and intersections. I would probably write some implementations, but I am not sure on how to do these operations efficiently on large datasets. Is there a set of possible test cases to work with for these tasks?

lysogeny avatar Jun 26 '21 15:06 lysogeny

The merge methods would need to be generic. I don't see merge methods being added as their requirements are far too variable. But, contains or intersects methods could be made available. Test cases for such methods would need designing/sourcing and writing. I would not initially be concerned about speed or efficiency, those may be addressed later. GenomicFeatures has some implemented benchmarks, so you could add to those to evaluate and track what is more or less efficient.

The following is an example of a custom merge method where intervals have metadata.

function default_merge_checks(interval_a, interval_b)
    seqname(interval_a) == seqname(interval_b) || error("Intervals do not have the same seqname.")
    strand(interval_a) == strand(interval_b) || error("Intervals do not have the same strand.")
    interval_a < interval_b || error("Intervals are not in order.")
end

function custom_merge(f::Function, interval_a::I, interval_b::I; check::Function = default_merge_checks) where {T, I<:AbstractInterval{T}}

    check(interval_a, interval_b)

    return I(
        seqname(interval_a),
        leftposition(interval_a),
        rightposition(interval_b),
        strand(interval_a),
        f(metadata(interval_a, metadata(interval_b))
    )

end
interval = custom_merge(+, interval_a, interval_b)

CiaranOMara avatar Aug 08 '21 17:08 CiaranOMara