ArchR icon indicating copy to clipboard operation
ArchR copied to clipboard

Issue in .getFragsFromArrow (and thus addGroupCoverages) when nFrags = 1

Open dagarfield opened this issue 3 years ago • 3 comments

I ran into a small issue when running addGroupCoverages on my terrible non-model species genome.

ArchR-addGroupCoverages-765f62f12d5f-Date-2021-04-18_Time-23-35-59.log

addGroupCoverages(ArchRProj = projSpurp_doubletRemoved_filtered2, groupBy = "Clusters_doubletRemoved_filtered", threads = 1, force = TRUE) #just the defaults

Gives us:

Error in .[, 1]: incorrect number of dimensions Traceback:

  1. addGroupCoverages(ArchRProj = projSpurp_doubletRemoved_filtered2, . groupBy = "Clusters_doubletRemoved_filtered", threads = 1, . force = TRUE)
  2. .batchlapply(args)
  3. do.call(.safelapply, args)
  4. (function (..., threads = 1, preschedule = FALSE) . { . if (tolower(.Platform$OS.type) == "windows") { . threads <- 1 . } .......(etc..lots of text and cell names)
  5. lapply(...)
  6. FUN(X[[i]], ...)
  7. .getFragsFromArrow(ArrowFiles[j], chr = availableChr[k], out = "GRanges", . cellNames = cellGroupi)
  8. .h5read(ArrowFile, paste0("Fragments/", chr, "/Ranges"), method = method) %>% . { . IRanges(start = .[, 1], width = .[, 2]) . }
  9. IRanges(start = .[, 1], width = .[, 2])

The issue seems to be not with addGroupCoverage per se, but rather with the helper function .getFragsFromArrow, which seems to be choking when it encounters a chromosome where

ArchR:::.h5read(inArrow2, paste0(paste0("Fragments/",myChr), "/Ranges"))

Contains only one fragment (which happened in this case, bad luck, after filtering out cells)

It looks like .getFragsFromArrow has a catch for cases of no fragments, namely

nFrags <- sum(.h5read(ArrowFile, paste0("Fragments/", chr, "/RGLengths"), method = method)) if (nFrags == 0) {....

https://github.com/GreenleafLab/ArchR/blob/968e4421ce7187a8ac7ea1cf6077412126876d5f/R/ArrowRead.R#L172

But it goes funny with only one fragment here:

if (is.null(cellNames) | tolower(method) == "fast") { output <- .h5read(ArrowFile, paste0("Fragments/", chr, "/Ranges"), method = method) %>% { IRanges(start = .[, 1], width = .[, 2]) }

https://github.com/GreenleafLab/ArchR/blob/968e4421ce7187a8ac7ea1cf6077412126876d5f/R/ArrowRead.R#L184

Presumably the fix is to set the catch as nFrags <2 rather than ==0 (?).

I promise that in the next bug I will actually figure out how to use permalinks.....

dagarfield avatar Apr 18 '21 21:04 dagarfield

Note: Changing if (nFrags == 0) to if (nFrags < 2) in .getFragsFromArrow does indeed seem to resolve the issue. Please let me know if this was an unintentionally bad idea. Otherwise, close and push in next release. Thanks!

(though it leads to some funny issues later when writing out BED files as .getCoverageInsertionSites leads an empty integer(0) when it tries to process an empty coverage file -- for BED file conversion, I dealt with this by editing .writeCoverageToBed to only write a line if .getCoverageInsertionSites returns something real)

dagarfield avatar Apr 19 '21 04:04 dagarfield

I added permalinks to the places I think you are pointing to. Leaving this issue for @jgranja24 to resolve

rcorces avatar Apr 19 '21 17:04 rcorces

Great, thanks!

My modified code does what I need locally, but the error is part of a general theme -- for non-model species with small scaffolds (or small chromosomes):

  1. There will be some chromosomes without coverage (y'all already fixed this, I think)
  2. There will be some cases where small chromosomes have coverage, but no reads near a gene (or no genes on the scaffold) leading to issues in calculating gene score (I think also fixed, though I filter out all scaffolds without genes, so not sure what happens there)
  3. If you subset cells (e.g. after filtering) you can once again end up with (1) and (2). This leads to issues in saving subsetted arrow files (which in turn leads to issues in marker gene calculations)
  4. In generating coverage files, you can get weird results if you have fewer than 2 fragments (coverage files are fun because even when all of the above is dealt with, you can get coverage issues when you cluster because now you have only a subset of cells)
  5. When converting coverage files to BED files (a step in the peak calling process), you end up with bed entries for no coverage scaffolds/chromosomes that have a chromosome name (column 1) but no start or end coordinates (which MACS2 does not like). I locally modified .writeCoverageToBed to deal with this issue.

My guess is that aside from the challenges of creating the initial genome object, this is the theme that will hit most people using organisms with incompletely assembled genomes -- some scaffolds have no reads (or no reads near genes) and some scaffolds have no genes.

If you decide to tackle this, let me know if I can help via example files, etc.

dagarfield avatar Apr 20 '21 03:04 dagarfield