signac
signac copied to clipboard
MACS2 peaks beyond chromosome
Hi @timoast,
I am running into an issue I have seen on here, but cannot seem to identify a solution. I am currently trying to analyze an snATAC dataset from zebrafish. While running AddMotifs, I get the error:
> library(BSgenome.Drerio.UCSC.danRer11)
> posit.freq.matrices <- getMatrixSet(x = JASPAR2020,
+ opts = list(collection = "CORE",
+ tax_group = "vertebrates",
+ all_versions = F))
> seqlevelsStyle(BSgenome.Drerio.UCSC.danRer11) <- 'Ensembl'
> iom.atac <- AddMotifs(object = iom.atac, genome = BSgenome.Drerio.UCSC.danRer11, pfm = posit.freq.matrices)
Error in loadFUN(x, seqname, ranges) : trying to load regions beyond the boundaries of non-circular sequence "chr23"
Based on issue #860, it looks like this is because I have peaks that extend beyond the chromosome.
> gr <- granges(iom.atac[['peaks']])
> end(BSgenome.Drerio.UCSC.danRer11$`23`)
[1] 46223584 1
> max(end(gr[seqnames(gr) == "23"]))
[1] 46223630
Interestingly, I was able to run AddMotifs without error on the peaks from my CellRanger pre-processed data - this error only occurred after I re-did the peak-calling with macs2.
I know you mentioned in #860 that the issue may be that the version of the genome is different between the data and the BSgenome object, however both the BSgenome and data use the same assembly, GRCz11. Are there any other potential causes for this issue, or is there any way around it?
Thank you for this great tool and your support, Troy
> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome_1.58.0 rtracklayer_1.50.0 Biostrings_2.58.0
[4] XVector_0.30.0 TFBSTools_1.28.0 JASPAR2020_0.99.10
[7] Signac_1.6.0 tidyr_1.2.0 readr_2.1.2
[10] magrittr_2.0.2 patchwork_1.1.1 cowplot_1.1.1
[13] MAST_1.16.0 SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
[16] Biobase_2.50.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
[19] IRanges_2.24.1 S4Vectors_0.28.1 BiocGenerics_0.36.1
[22] MatrixGenerics_1.2.1 matrixStats_0.61.0 sctransform_0.3.3
[25] ggplot2_3.3.5 Matrix_1.3-4 dplyr_1.0.8
[28] SeuratObject_4.0.4 Seurat_4.1.0
loaded via a namespace (and not attached):
[1] utf8_1.2.2 reticulate_1.24 R.utils_2.11.0
[4] tidyselect_1.1.2 poweRlaw_0.70.6 RSQLite_2.2.10
[7] AnnotationDbi_1.52.0 htmlwidgets_1.5.4 grid_4.0.5
[10] docopt_0.7.1 BiocParallel_1.24.1 Rtsne_0.15
[13] munsell_0.5.0 codetools_0.2-18 ica_1.0-2
[16] future_1.24.0 miniUI_0.1.1.1 withr_2.5.0
[19] spatstat.random_2.1-0 colorspace_2.0-3 knitr_1.37
[22] rstudioapi_0.13 ROCR_1.0-11 tensor_1.5
[25] listenv_0.8.0 slam_0.1-50 GenomeInfoDbData_1.2.4
[28] polyclip_1.10-0 bit64_4.0.5 farver_2.1.0
[31] parallelly_1.30.0 vctrs_0.3.8 generics_0.1.2
[34] xfun_0.30 lsa_0.73.2 ggseqlogo_0.1
[37] R6_2.5.1 bitops_1.0-7 spatstat.utils_2.3-0
[40] cachem_1.0.6 DelayedArray_0.16.3 assertthat_0.2.1
[43] promises_1.2.0.1 scales_1.1.1 gtable_0.3.0
[46] globals_0.14.0 goftest_1.2-3 seqLogo_1.56.0
[49] rlang_1.0.2 RcppRoll_0.3.0 splines_4.0.5
[52] lazyeval_0.2.2 spatstat.geom_2.3-2 reshape2_1.4.4
[55] abind_1.4-5 httpuv_1.6.5 tools_4.0.5
[58] ellipsis_0.3.2 spatstat.core_2.4-0 RColorBrewer_1.1-2
[61] ggridges_0.5.3 Rcpp_1.0.7 plyr_1.8.6
[64] zlibbioc_1.36.0 purrr_0.3.4 RCurl_1.98-1.6
[67] rpart_4.1-15 deldir_1.0-6 pbapply_1.5-0
[70] zoo_1.8-9 ggrepel_0.9.1 cluster_2.1.1
[73] motifmatchr_1.12.0 data.table_1.14.2 scattermore_0.8
[76] lmtest_0.9-39 RANN_2.6.1 SnowballC_0.7.0
[79] fitdistrplus_1.1-6 hms_1.1.1 mime_0.12
[82] xtable_1.8-4 XML_3.99-0.9 sparsesvd_0.2
[85] gridExtra_2.3 compiler_4.0.5 tibble_3.1.6
[88] KernSmooth_2.23-18 crayon_1.5.0 R.oo_1.24.0
[91] htmltools_0.5.2 mgcv_1.8-34 later_1.3.0
[94] tzdb_0.2.0 DBI_1.1.2 tweenr_1.0.2
[97] MASS_7.3-53.1 cli_3.2.0 R.methodsS3_1.8.1
[100] igraph_1.2.11 pkgconfig_2.0.3 GenomicAlignments_1.26.0
[103] TFMPvalue_0.0.8 plotly_4.10.0 spatstat.sparse_2.1-0
[106] annotate_1.68.0 DirichletMultinomial_1.32.0 stringr_1.4.0
[109] digest_0.6.29 RcppAnnoy_0.0.19 pracma_2.3.8
[112] CNEr_1.26.0 spatstat.data_2.1-2 leiden_0.3.9
[115] fastmatch_1.1-3 uwot_0.1.11 shiny_1.7.1
[118] Rsamtools_2.6.0 gtools_3.9.2 lifecycle_1.0.1
[121] nlme_3.1-152 jsonlite_1.8.0 viridisLite_0.4.0
[124] fansi_1.0.2 pillar_1.7.0 lattice_0.20-41
[127] KEGGREST_1.30.1 fastmap_1.1.0 httr_1.4.2
[130] survival_3.2-10 GO.db_3.12.1 glue_1.6.2
[133] qlcMatrix_0.9.7 png_0.1-7 bit_4.0.4
[136] ggforce_0.3.3 stringi_1.7.6 blob_1.2.2
[139] caTools_1.18.2 memoise_2.0.1 irlba_2.3.5
[142] future.apply_1.8.1
In looking at the CellRanger peak calling, it looks like there is a peak called at the very end of chr23. Is it possible that the CallPeaks function might have extended this peak with the default ext.size = 200 such that it would extend beyond the boundary of the chromosome? If so, is there any way to remove this peak (and any others extending beyond their chromosome) so it does not interfere with downstream analysis?
So macs2 did in fact extend a peak at the end of the chromosome beyond the chromosome, and removing that single peak by taking the CallPeaks() output and setting that row to NULL worked to resolve the issue. This might be an enhancement suggestion, in case there is any way to include a limitation in the CallPeaks() function to clip any peaks that would extend beyond the annotated chromosome.
@t-and thanks for looking into it, I agree it would be good to have an automatic way of preventing this but it's difficult to do since we don't currently require any information about the chromosome lengths when running CallPeaks()
. We could add an optional parameter to supply this in the future and clip peaks that fall outside chromosome boundaries.
The optional parameter would definitely be helpful, since I'd think most users would likely have some sort of BSgenome or other genome attached to use as a reference for where the chromosome should end. Thanks for being so attentive to the issues posted on here! If it's okay, I am going to close this issue now that the actual error is resolved.
I'll just leave this open until we have something added to Signac to handle this properly