Biostrings
Biostrings copied to clipboard
Misleading show() method for XStringSet objects
This is a follow up of https://support.bioconductor.org/p/122340/#122400
The show()
method for XStringSet objects currently suggests the existence of a seq()
getter for these objects:
library(Biostrings)
library(drosophila2probe)
dna <- DNAStringSet(drosophila2probe)
dna
# A DNAStringSet instance of length 265400
# width seq
# [1] 25 CCTGAATCCTGGCAATGTCATCATC
# [2] 25 ATCCTGGCAATGTCATCATCAATGG
# [3] 25 ATCAGTTGTCAACGGCTAATACGCG
# [4] 25 ATCAATGGCGATTGCCGCGTCTGCA
# [5] 25 CCGCGTCTGCAATGTGAGGGCCTAA
# ... ... ...
# [265396] 25 TACTACTTGAGCCACAACCATCTGA
# [265397] 25 AGGGACTAAAGAGGCCCCATGCTCT
# [265398] 25 CATGCTCTGTCTGGTGTCAGCGCTA
# [265399] 25 GTCAGCGCTACATGGTCCAGGACAA
# [265400] 25 CCAGGACAAGTATGGACTTCCCCAC
but there is no such getter.
Same issue with the show()
method for XString objects:
dna[[1]]
# 25-letter "DNAString" instance
# seq: CCTGAATCCTGGCAATGTCATCATC
Also it would be good to make these show()
methods more consistent with other show()
methods in S4Vectors/IRanges/GenomicRanges:
library(IRanges)
IRanges(1:3, 10, names=LETTERS[1:3], score=runif(3))
# IRanges object with 3 ranges and 1 metadata column:
# start end width | score
# <integer> <integer> <integer> | <numeric>
# A 1 10 10 | 0.267148569226265
# B 2 10 9 | 0.106218574102968
# C 3 10 8 | 0.649568639695644
In particular the names on a DNAStringSet object should be displayed on the left. Also its metadata columns should be displayed (right now they are not):
dna2 <- dna[1:3]
names(dna2) <- LETTERS[1:3]
mcols(dna2)$score <- runif(3)
dna2
# A DNAStringSet instance of length 3
# width seq names
# [1] 25 CCTGAATCCTGGCAATGTCATCATC A
# [2] 25 ATCCTGGCAATGTCATCATCAATGG B
# [3] 25 ATCAGTTGTCAACGGCTAATACGCG C
Somehow related is the initial value displayed for mcols()
> mcols(DNAStringSet())
NULL
> mcols(GRanges())
DataFrame with 0 rows and 0 columns
This has not much to do with the show()
method but with the fact that the mcols()
are allowed to be NULL
for some Vector derivatives like Hits, Rle, IRanges, DNAStringSet, etc... For other Vector derivatives like GRanges, GRangesList, SummarizedExperiment, etc... mcols()
is forced to be a DataFrame. An inconsistency situation that we should discuss in a different issue if we think it should be addressed.
There also some other inconsistencies for showing the name of elements. The length of names seems to be treated differently. Probably a historic reason based on the positioning of the names (left vs. right.)
library(Biostrings)
library(GenomicRanges)
seq <- RNAStringSet(c("UAUCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAUUAAGCCAUGCAUGUCUAAGUAUAAGCAAUUUAUACAGUGAAACUGCGAAUGGCUCA",
"CCGAGAGGUCUUGGUAAUCUUGUGAAACUCCGUCGUGCUGGGGAUAGAGCAUUGUAAUUAUUGCUCUUCAACGAGGAAUUCCUAGUAAGCGCAAGUCAUCA"))
names(seq) <- c("TheFirstVeryLongNameAndItIsGettingEvenLongerByTheLetter",
"TheSecondVeryLongNameAndItIsGettingEvenLongerByTheLetter")
gr <- GRanges(c("chr1:5-10:+","chr1:6-10:+"))
names(gr) <- names(seq)
seq
#> A RNAStringSet instance of length 2
#> width seq names
#> [1] 100 UAUCUGGUUGAUCCUGCCAGU...GUGAAACUGCGAAUGGCUCA TheFirstVeryLongN...
#> [2] 101 CCGAGAGGUCUUGGUAAUCUU...CUAGUAAGCGCAAGUCAUCA TheSecondVeryLong...
gr
#> GRanges object with 2 ranges and 0 metadata columns:
#> seqnames
#> <Rle>
#> TheFirstVeryLongNameAndItIsGettingEvenLongerByTheLetter chr1
#> TheSecondVeryLongNameAndItIsGettingEvenLongerByTheLetter chr1
#> ranges
#> <IRanges>
#> TheFirstVeryLongNameAndItIsGettingEvenLongerByTheLetter 5-10
#> TheSecondVeryLongNameAndItIsGettingEvenLongerByTheLetter 6-10
#> strand
#> <Rle>
#> TheFirstVeryLongNameAndItIsGettingEvenLongerByTheLetter +
#> TheSecondVeryLongNameAndItIsGettingEvenLongerByTheLetter +
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Right, long names are truncated. But maybe that's a good thing and we should keep that when we move them to the left. I don't know.
Yeah, these things predate GRanges. The show()
methods for XStringSet, XStringViews, and XString objects are actually my first show()
methods ever. I implemented them more than 13 years ago when I took over the refactoring and maintenance of Biostrings. At that time we didn't have any of the IRanges, GenomicRanges, or S4Vectors packages yet.