rGREAT
rGREAT copied to clipboard
[Request] Support for genomes beyond GenomeInfoDb's seqlevelsStyle
Thank you for this helpful package. I noticed that the package has a fairly strong integration with GenomeInfoDb
.
(From the perspective of Bioconductor data structure integration that's great.)
But, the limited number of supported seqlevelsStyle/genomeStyles
in GenomeInfoDb
for "only" human, mouse, aso, makes it difficult to use the rGREAT
package for "exotic" genomes (such as bacteria).
Even when manually preparing the arguments gr, gene_sets, extended_tss, background
as genomic ranges, the "local" great
function is currently hard coded to query for the seqlevelsStyle
of the input, eg in the line
seqlevelsStyle(extended_tss) = seqlevelsStyle(gr)[1]
Which of course fails for my bacterial genome input
Error in seqlevelsStyle(seqlevels) :
The style does not have a compatible entry for the species supported by Seqname. Please see genomeStyles() for supported species/style
Info: The seqlevels are needed to query for the genome length to build the larger background.
...
sl = seqlengths(extended_tss)
if(any(is.na(sl))) {
stop_wrap("`extended_tss` must have `seqlengths` been set.")
}
gr_genome = GRanges(seqnames = names(sl), ranges = IRanges(1, sl))
...
My current solution is to copy/paste/adapt the function code of great
to avoid the seqlevels
and manually provide the genome lengths. However, my code is a bit ad hoc and breaks of course the seqlevelsStyle support.
Thus, my request: Would it be possible to refactor the function great
to allow for custom genomes?
For instance, the math logic of the function (starting at the foreach) could be a separate function.
Since the various if statements in the first half of the function prepare the data structure for the main computation, a separate function might allow for custom wrappers.
For completeness, here is my current workaround.
My approach was to remove most of the if statements in the first part of the function to fit my purpose. Then I added a new argument that provides custom genome length values for the sl
variable, instead of the sl = seqlengths(extended_tss)
call.
https://gist.github.com/asgeissler/124e89d9fb9e2ae9ae30597e0048dc62
Hi, sorry for the very late reply. Another user inquiry leads me here :)
The design of the great()
function is to let it be general to any organism with any format of the genome. Maybe I missed something. I will have a check and reply to you soon.
For the behaviour of seqlevelsStyle()
, my original understanding is that if the genome is not supported, it does nothing. Maybe the default behaviour has been changed in newer bioc versions to throw an error? I will also check that.
I have fixed the bug of the error from seqlevelsStyle()
.