plyranges
plyranges copied to clipboard
[BUG] Inconsistent behaviour of join_nearest_downstream() with one unstranded range set?
I am not sure if this is intented or if I am missunderstanding join_nearest_downstream().
I have a set of regions that are unstranded (they come from ChIPseq data). I am trying to find the genes that have any of these regions upstream of their transcription start site (or overlap with their coding region). These regions are unstranded but I would like to obtain only the nearest gene downstream of them (meaning that if there is a nearer gene, but the region is downstream of that gene, this gene is ignored).
I use join_nearest_downstream() as it should take into account the strandness of the gene regions and I assume it ignores the strandness of the "x" region set as per the documentation: "method will find arbitrary nearest neighbour ranges on x that are upstream of those on y", but it seems this is not the case. I should find at least one gene for each of the regions in the dataset (these genes might be duplicated if they have several regions in their upstream region, but I can deal with that).
Is this the correct way of doing it ?
The issue is that for some of these regions, no downstream gene is found. When I visualize them or manually check if there is a gene close, I can find it. I can also use join_nearest() to find them, but that has some limitations (see below).
I am working with Arabidopsis data, so as genes I am using these data:
ath_coding_genes <- read_gff("./Arabidopsis_thaliana.TAIR10.57.gff3.gz") %>%
filter(type == "gene")
Then I try to find the nearest gene that is downstream of my regions:
my_regions <- dget("original_data.txt") #these regions are unstranded
genes_downstream <- plyranges::join_nearest_downstream(my_regions, ath_coding_genes, distance = TRUE)
This works as expected for many genes, but in some cases it doesnt. I checked some genes that shuold have a peak upstream but they do not show in the "joined" dataset:
test_peak <- "Merged-1-19698653-1"
test_gene <- "AT1G52890"
> my_regions %>% filter(name == test_peak)
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] 1 19698603-19698703 * | Merged-1-19698653-1 1
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
> ath_coding_genes %>% filter(gene_id == test_gene) %>% plyranges::select(gene_id)
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
[1] 1 19696923-19698482 - | AT1G52890
-------
seqinfo: 7 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
Both regions are 120 nt apart:
> reg1 <- my_regions %>% filter(name == test_peak)
> reg2 <- ath_coding_genes %>% filter(gene_id == test_gene)
> distance(reg1, reg2)
[1] 120
If I use join_nearest() instead, they are joined. However, using nearest will also join some genes that have these regions downstream, and I only want the genes that have these regions upstream.
> nearest_genes <- plyranges::join_nearest(my_regions, ath_coding_genes, distance = TRUE)
> nearest_genes %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 1 19698603-19698703 * | Merged-1-19698653-1 AT1G52890
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
> nearest_genes %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 1 19698603-19698703 * | Merged-1-19698653-1 AT1G52890
[2] 1 19698483-19698583 * | Merged-1-19698533-1 AT1G52890
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
For reference, I have written a function using basic GRanges that does what I expected (find which genes are downstream or overlapping a set of regions).
If the observerd behaviour of join_nearest_downstream() is expected, I would modify the documentation and/or include a warning message if using unstranded GRanges.
find_preceding = function(peak_regions, gene_granges) {
# First we find the genes that have an overlapping region in their coding sequence:
overlapping_hits <- findOverlaps(peak_regions, gene_granges)
overlapping_regions <- peak_regions[queryHits(overlapping_hits)]
overlapping_genes <- gene_granges[subjectHits(overlapping_hits)]
overlapping_gene_ids <- overlapping_genes$gene_id
overlapping_granges <- overlapping_regions %>%
mutate(gene_id = overlapping_gene_ids,
dist = distance(overlapping_regions, overlapping_genes)) %>%
mutate(name = paste0(name, "_", gene_id))
# Then we remove the overlapping regions and search for regions preceding genes
non_overlapping_regions <- peak_regions[-queryHits(overlapping_hits)]
preceding_hits <- precede(non_overlapping_regions, gene_granges)
preceding_genes <- gene_granges[preceding_hits]
preceding_gene_ids <- preceding_genes$gene_id
preceding_granges <- non_overlapping_regions %>%
mutate(gene_id = preceding_gene_ids,
dist = distance(non_overlapping_regions, preceding_genes)) %>%
mutate(name = paste0(name, "_", gene_id))
# Finally, we merge the overlapping and preceding granges
final_granges <- bind_ranges(overlapping_granges, preceding_granges) %>%
sortSeqlevels() %>% sort()
return(final_granges)
}