Biostrings Size of DNAStringSet object not compatible with `nchar()`

Hi all,

For a long time, I've noticed that {Biostrings} is very weird in terms of memory usage. For instance, when one subsets a DNAStringSet object to create a much smaller DNAStringSet object, object sizes do not change much (if at all). Here is a demonstration, comparing a DNAStringSet object with an entire plant genome to another (much smaller) DNAStringSet object with promoter sequences only:

suppressPackageStartupMessages({
    library(Biostrings)
    library(GenomicRanges)
    library(lobstr)
})
options(timeout = 1e4)

# Load genome and annotation 
genome <- readDNAStringSet(
    "https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-61/plants/fasta/hordeum_vulgare/dna/Hordeum_vulgare.MorexV3_pseudomolecules_assembly.dna_rm.toplevel.fa.gz"
)
names(genome) <- gsub(" .*", "", names(genome))

annot <- rtracklayer::import(
    "https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-61/plants/gff3/hordeum_vulgare/Hordeum_vulgare.MorexV3_pseudomolecules_assembly.61.gff3.gz"
)

# Keep genes only and add seqlengths
annot <- annot[annot$type == "gene"]
sl <- setNames(width(genome), names(genome))
sl <- sl[seqlevels(annot)]
seqlengths(annot) <- sl

# Extract promoter sequences
prom_ranges <- trim(promoters(annot, 1000, 200))
prom_seqs <- BSgenome::getSeq(genome, prom_ranges)
names(prom_seqs) <- prom_ranges$gene_id

# Select promoters of random genes
sel_genes <- sample(names(prom_seqs), 1e4, replace = FALSE)
prom_seqs_subset <- prom_seqs[sel_genes]

# Inspect objects
size_df <- data.frame(
    name = c("genome", "promoters", "promoters_subset"),
    size = c(
        obj_size(genome), 
        obj_size(prom_seqs), 
        obj_size(prom_seqs_subset)
    ),
    Mbp = c(
        sum(nchar(genome)) / 1e6, 
        sum(nchar(prom_seqs)) / 1e6, 
        sum(nchar(prom_seqs_subset)) / 1e6
    )
)

size_df
#>               name    size        Mbp
#> 1           genome 4.23 GB 4225.57752
#> 2        promoters 4.25 GB   42.98556
#> 3 promoters_subset 4.25 GB   11.99864

# Does removing `genome` do anything?
rm(genome)

size_df2 <- data.frame(
    name = c("promoters", "promoters_subset"),
    size = c(
        obj_size(prom_seqs), 
        obj_size(prom_seqs_subset)
    ),
    Mbp = c(
        sum(nchar(prom_seqs)) / 1e6, 
        sum(nchar(prom_seqs_subset)) / 1e6
    )
)

size_df2
#>               name    size      Mbp
#> 1        promoters 4.25 GB 42.98556
#> 2 promoters_subset 4.25 GB 11.99864


sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.1 (2024-06-14)
#>  os       Ubuntu 22.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Brussels
#>  date     2025-08-07
#>  pandoc   3.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package              * version   date (UTC) lib source
#>  abind                  1.4-5     2016-07-21 [1] CRAN (R 4.4.1)
#>  Biobase                2.64.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  BiocGenerics         * 0.50.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  BiocIO                 1.14.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  BiocParallel           1.38.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  Biostrings           * 2.72.1    2024-06-02 [1] Bioconductor 3.19 (R 4.4.1)
#>  bitops                 1.0-7     2021-04-24 [1] CRAN (R 4.4.1)
#>  BSgenome               1.72.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  cli                    3.6.3     2024-06-21 [1] CRAN (R 4.4.1)
#>  codetools              0.2-20    2024-03-31 [1] CRAN (R 4.4.1)
#>  crayon                 1.5.3     2024-06-20 [1] CRAN (R 4.4.1)
#>  curl                   5.2.1     2024-03-01 [1] CRAN (R 4.4.1)
#>  DelayedArray           0.30.1    2024-05-07 [1] Bioconductor 3.19 (R 4.4.1)
#>  digest                 0.6.36    2024-06-23 [1] CRAN (R 4.4.1)
#>  evaluate               0.24.0    2024-06-10 [1] CRAN (R 4.4.1)
#>  fastmap                1.2.0     2024-05-15 [1] CRAN (R 4.4.1)
#>  fs                     1.6.4     2024-04-25 [1] CRAN (R 4.4.1)
#>  GenomeInfoDb         * 1.40.1    2024-05-24 [1] Bioconductor 3.19 (R 4.4.1)
#>  GenomeInfoDbData       1.2.12    2024-07-24 [1] Bioconductor
#>  GenomicAlignments      1.40.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  GenomicRanges        * 1.56.1    2024-06-12 [1] Bioconductor 3.19 (R 4.4.1)
#>  glue                   1.8.0     2024-09-30 [1] https://cran.r-universe.dev (R 4.4.1)
#>  htmltools              0.5.8.1   2024-04-04 [1] CRAN (R 4.4.1)
#>  httr                   1.4.7     2023-08-15 [1] CRAN (R 4.4.1)
#>  IRanges              * 2.38.1    2024-07-03 [1] Bioconductor 3.19 (R 4.4.1)
#>  jsonlite               1.8.8     2023-12-04 [1] CRAN (R 4.4.1)
#>  knitr                  1.48      2024-07-07 [1] CRAN (R 4.4.1)
#>  lattice                0.22-6    2024-03-20 [1] CRAN (R 4.4.1)
#>  lifecycle              1.0.4     2023-11-07 [1] CRAN (R 4.4.1)
#>  lobstr               * 1.1.2     2022-06-22 [1] CRAN (R 4.4.1)
#>  Matrix                 1.7-0     2024-04-26 [1] CRAN (R 4.4.1)
#>  MatrixGenerics         1.16.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  matrixStats            1.3.0     2024-04-11 [1] CRAN (R 4.4.1)
#>  prettyunits            1.2.0     2023-09-24 [1] CRAN (R 4.4.1)
#>  R6                     2.5.1     2021-08-19 [1] CRAN (R 4.4.1)
#>  RCurl                  1.98-1.16 2024-07-11 [1] CRAN (R 4.4.1)
#>  reprex                 2.1.1     2024-07-06 [1] CRAN (R 4.4.1)
#>  restfulr               0.0.15    2022-06-16 [1] CRAN (R 4.4.1)
#>  rjson                  0.2.21    2022-01-09 [1] CRAN (R 4.4.1)
#>  rlang                  1.1.4     2024-06-04 [1] CRAN (R 4.4.1)
#>  rmarkdown              2.27      2024-05-17 [1] CRAN (R 4.4.1)
#>  Rsamtools              2.20.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  rstudioapi             0.16.0    2024-03-24 [1] CRAN (R 4.4.1)
#>  rtracklayer            1.64.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  S4Arrays               1.4.1     2024-05-20 [1] Bioconductor 3.19 (R 4.4.1)
#>  S4Vectors            * 0.42.1    2024-07-03 [1] Bioconductor 3.19 (R 4.4.1)
#>  sessioninfo            1.2.2     2021-12-06 [1] CRAN (R 4.4.1)
#>  SparseArray            1.4.8     2024-05-24 [1] Bioconductor 3.19 (R 4.4.1)
#>  SummarizedExperiment   1.34.0    2024-05-01 [1] Bioconductor 3.19 (R 4.4.1)
#>  UCSC.utils             1.0.0     2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  withr                  3.0.0     2024-01-16 [1] CRAN (R 4.4.1)
#>  xfun                   0.51      2025-02-19 [1] CRAN (R 4.4.1)
#>  XML                    3.99-0.17 2024-06-25 [1] CRAN (R 4.4.1)
#>  XVector              * 0.44.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#>  yaml                   2.3.9     2024-07-05 [1] CRAN (R 4.4.1)
#>  zlibbioc               1.50.0    2024-04-30 [1] Bioconductor 3.19 (R 4.4.1)
#> 
#>  [1] /home/faalm/R/x86_64-pc-linux-gnu-library/4.4
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

^{Created on 2025-08-07 with reprex v2.1.1}

Note that the number of sequences is dramatically different, while object sizes remain pretty much the same. Is that something you are aware of? Why does this happen?

As a side comment, I've also noticed that, for small sets of sequences, it's more efficient to store gzipped FASTA files compared to binary .rds files with XStringSet objects. This is not true for large XStringSet objects, though. It seems like subsetting a large XStringSet object preserves something of the original object in the subset. Does that make sense?

All the best, Fabricio

Aug 07 '25 08:08 almeidasilvaf

Hi,

The reason for what you observed is that BSgenome::getSeq() tries to avoid copying the sequence data from the original object when it can. More precisely, when a promoter sequence is a subsequence of a sequence in genome (which happens for any promoter sequence that is located on the plus strand), then BSgenome::getSeq() simply creates a "window" on the chromosome sequence. It doesn't actually copy any sequence data.

This means that the final object (prom_seqs) shares memory with the original object (genome). Without going into too many technical details, this sharing is achieved by storing external pointers to memory blocks that contain the original sequence data. You can see the addresses and sizes of these memory blocks (formally called SharedRaw objects) with:

genome@pool
# SharedRaw_Pool of length 8
# 1: SharedRaw of length 516505932 (data starting at address 0x7fccad36b040)
# 2: SharedRaw of length 665585731 (data starting at address 0x7fcc858aa040)
# 3: SharedRaw of length 621516506 (data starting at address 0x7fcc607f0040)
# 4: SharedRaw of length 610333535 (data starting at address 0x7fcc3c1e0040)
# 5: SharedRaw of length 588218686 (data starting at address 0x7fcc190e7040)
# 6: SharedRaw of length 561794515 (data starting at address 0x7fcbf7922040)
# 7: SharedRaw of length 632540561 (data starting at address 0x7fcbd1de5040)
# 8: SharedRaw of length 29082053 (data starting at address 0x7fccd0650040)

prom_seqs@pool
# SharedRaw_Pool of length 16
# 1: SharedRaw of length 516505932 (data starting at address 0x7fccad36b040)
# 2: SharedRaw of length 665585731 (data starting at address 0x7fcc858aa040)
# 3: SharedRaw of length 621516506 (data starting at address 0x7fcc607f0040)
# 4: SharedRaw of length 610333535 (data starting at address 0x7fcc3c1e0040)
# 5: SharedRaw of length 588218686 (data starting at address 0x7fcc190e7040)
# 6: SharedRaw of length 561794515 (data starting at address 0x7fcbf7922040)
# 7: SharedRaw of length 632540561 (data starting at address 0x7fcbd1de5040)
# 8: SharedRaw of length 29082053 (data starting at address 0x7fccd0650040)
# 9: SharedRaw of length 2572886 (data starting at address 0x5782656868c0)
# 10: SharedRaw of length 3462294 (data starting at address 0x5782659f36e0)
# 11: SharedRaw of length 3197280 (data starting at address 0x578265d62420)
# 12: SharedRaw of length 2387141 (data starting at address 0x57826d8d4df0)
# 13: SharedRaw of length 3238969 (data starting at address 0x578269655e80)
# 14: SharedRaw of length 2545713 (data starting at address 0x57826a3ca1e0)
# 15: SharedRaw of length 3362235 (data starting at address 0x57826b1417e0)
# 16: SharedRaw of length 455242 (data starting at address 0x578250c31e60)

As you can see, the first 8 memory blocks are shared between genome and prom_seqs. The additional memory blocks you see in prom_seqs (blocks 9 to 16) were created by getSeq() to store the promoter sequences coming from the minus strand. In this case, using a window on the original memory block doesn't work because the promoter sequence is the reverse complement of the subsequence defined by some window on the original memory block.

Even if prom_seqs only contains small windows on the original memory blocks, this confuses tools like utils::object.size() or lobstr::obj_size(). These tools report the total size of all the shared memory blocks (plus some other bookkeeping data contained in the DNAStringSet object), hence why they report almost the same thing for genome and prom_seqs. They actually report something slightly bigger for the latter (4.25 GB instead of 4.23 GB) because of the additional memory blocks in prom_seqs.

rm(genome) doesn't help because even though it removes the genome object, it doesn't remove the 8 (big) original shared memory blocks which are used by prom_seqs.

To "detach" prom_seqs from the original memory blocks, you can use compact():

prom_seqs <- compact(prom_seqs)

This copies only the necessary data from the (big) original memory blocks and put them in new memory blocks that have a much smaller total size:

prom_seqs@pool
# SharedRaw_Pool of length 16
# 1: SharedRaw of length 2737363 (data starting at address 0x57824e3ac5c0)
# 2: SharedRaw of length 3489068 (data starting at address 0x57826b601e80)
# 3: SharedRaw of length 3262538 (data starting at address 0x57826794a4d0)
# 4: SharedRaw of length 2504478 (data starting at address 0x57825bba1610)
# 5: SharedRaw of length 3407007 (data starting at address 0x57826eb817a0)
# 6: SharedRaw of length 2606529 (data starting at address 0x57826c19bfe0)
# 7: SharedRaw of length 3221208 (data starting at address 0x57824f9ff080)
# 8: SharedRaw of length 401575 (data starting at address 0x57825fafeaa0)
# 9: SharedRaw of length 2572886 (data starting at address 0x57826d138c90)
# 10: SharedRaw of length 3462294 (data starting at address 0x578254bde440)
# 11: SharedRaw of length 3197280 (data starting at address 0x578253f91640)
# 12: SharedRaw of length 2387141 (data starting at address 0x57825c8fe200)
# 13: SharedRaw of length 3238969 (data starting at address 0x57826875bd60)
# 14: SharedRaw of length 2545713 (data starting at address 0x57826bf2e770)
# 15: SharedRaw of length 3362235 (data starting at address 0x578262f15b40)
# 16: SharedRaw of length 455242 (data starting at address 0x578257877e00)

obj_size(prom_seqs)
# 43.29 MB

We went from 4.23 GB to 43.29 MB 😃

Note that the 8 (big) original shared memory blocks are no longer used by anything therefore they are now candidates for garbage collection. In other words, the next call to gc() should free about 4.23 GB of Vcells.

Unfortunately, if you serialize prom_seqs before calling compact(), you end up serializing the 8 (big) original memory blocks, which is not good 😉 So before serializing a DNAStringSet object that was obtained by extracting sequences from a much bigger DNAStringSet object, make sure to first pass it thru compact().

Hope this helps, H.

Nov 05 '25 06:11 hpages

Hi @hpages

Thank you for this very detailed and clear explanation!

I had never heard of compact() before. Maybe you could add a small note about it on Biostrings' vignette? I guess other users face the same struggle when working with large XStringSet objects (entire genomes, for instance).

My current workaround is to export the XStringSet object with the subset to a FASTA file, and then read it. This frees up memory as expected, so I guess it's equivalent to calling compact() right after creating the subset.

Now, from a software design point of view: are there performance advantages of using the current implementation (with memory sharing)? AFAIK, no other core BioC S4 class uses this approach, am I right? Instead of relying on users calling compact() every time, why not make this the default or add an extra parameter (say, clean = FALSE) that users can set to TRUE? The latter would force maintainers (you & co) to document this behavior in the function's man page, so I believe it's better than a note in the vignette.

All the best, Fabricio

Nov 05 '25 08:11 almeidasilvaf

The avoid-copying-sequence-data-if-possible approach in Biostrings was decided and implemented a long time ago (about 18 years ago or so). In most situations, it helps make things extremely fast and memory-efficient. In other situations like when you have a DNAStringSet object that has windows defined on a much bigger object, and you want to serialize it or claim back the memory used by the bigger object, it can be problematic. It's a double-edged sword approach.

I like your suggestion to add an extra argument to getSeq() (e.g. data.copy or something like that). I just created an issue for it: issue #131.

I agree that compact() should be advertized more (but in hindsight I don't really like the name so would need to find a better one).

Thanks for your feedback.

H.

Nov 05 '25 17:11 hpages