ArchR
ArchR copied to clipboard
Mitochondrial genes removed at import10xFeatureMatrix
Hi Jeffrey and Ryan,
I was testing the multiome pipeline following the tutorial below. https://greenleaflab.github.io/ArchR_2020/Ex-Analyze-Multiome.html
However, I found out that at the import10xFeatureMatrix function, all the mitochondrial genes were removed.
seRNA <- import10xFeatureMatrix(input = input_files, names = input_names)
Importing Feature Matrix 1 of 1
sum(str_detect(rownames(seRNA),'MT-'))
[1] 0
Is this the default behavior? I have confirmed that they exist in the cellranger output like below.
Thanks!
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8
[5] LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils datasets methods
[10] base
other attached packages:
[1] openxlsx_4.2.3 ggrepel_0.9.1
[3] circlize_0.4.12 ComplexHeatmap_2.6.2
[5] gtable_0.3.0 harmony_1.0
[7] Rcpp_1.0.6 uwot_0.1.10
[9] gridExtra_2.3 nabor_0.5.0
[11] BSgenome.Mmusculus.UCSC.mm10_1.4.0 SeuratDisk_0.0.0.9013
[13] qusage_2.24.0 limma_3.46.0
[15] stringr_1.4.0 BSgenome.Hsapiens.UCSC.hg38_1.4.3
[17] BSgenome_1.58.0 rtracklayer_1.50.0
[19] Biostrings_2.58.0 XVector_0.30.0
[21] Seurat_3.2.3 ArchR_1.0.1
[23] magrittr_2.0.1 rhdf5_2.34.0
[25] Matrix_1.2-18 data.table_1.13.6
[27] SummarizedExperiment_1.20.0 Biobase_2.50.0
[29] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
[31] IRanges_2.24.1 S4Vectors_0.28.1
[33] BiocGenerics_0.36.0 MatrixGenerics_1.2.0
[35] matrixStats_0.57.0 ggplot2_3.3.3
loaded via a namespace (and not attached):
[1] plyr_1.8.6 igraph_1.2.6 lazyeval_0.2.2
[4] splines_4.0.3 BiocParallel_1.24.1 listenv_0.8.0
[7] scattermore_0.7 digest_0.6.27 htmltools_0.5.1.1
[10] fansi_0.4.2 tensor_1.5 cluster_2.1.0
[13] ROCR_1.0-11 globals_0.14.0 colorspace_2.0-0
[16] xfun_0.20 dplyr_1.0.3 crayon_1.3.4
[19] RCurl_1.98-1.2 jsonlite_1.7.2 spatstat_1.64-1
[22] spatstat.data_1.7-0 survival_3.2-7 zoo_1.8-8
[25] glue_1.4.2 polyclip_1.10-0 emmeans_1.5.3
[28] zlibbioc_1.36.0 leiden_0.3.6 GetoptLong_1.0.5
[31] DelayedArray_0.16.1 Rhdf5lib_1.12.0 future.apply_1.7.0
[34] shape_1.4.5 abind_1.4-5 scales_1.1.1
[37] mvtnorm_1.1-1 DBI_1.1.1 miniUI_0.1.1.1
[40] viridisLite_0.3.0 xtable_1.8-4 clue_0.3-58
[43] reticulate_1.18 bit_4.0.4 rsvd_1.0.3
[46] htmlwidgets_1.5.3 httr_1.4.2 RColorBrewer_1.1-2
[49] ellipsis_0.3.1 ica_1.0-2 pkgconfig_2.0.3
[52] XML_3.99-0.5 deldir_0.2-9 tidyselect_1.1.0
[55] rlang_0.4.10 reshape2_1.4.4 later_1.1.0.1
[58] munsell_0.5.0 tools_4.0.3 cli_2.2.0
[61] generics_0.1.0 ggridges_0.5.3 evaluate_0.14
[64] fastmap_1.0.1 yaml_2.2.1 goftest_1.2-2
[67] bit64_4.0.5 knitr_1.30 fitdistrplus_1.1-3
[70] zip_2.1.1 purrr_0.3.4 RANN_2.6.1
[73] pbapply_1.4-3 future_1.21.0 nlme_3.1-150
[76] mime_0.9 hdf5r_1.3.3 compiler_4.0.3
[79] rstudioapi_0.13 plotly_4.9.3 png_0.1-7
[82] spatstat.utils_2.0-0 tibble_3.0.5 stringi_1.5.3
[85] lattice_0.20-41 fftw_1.0-6 vctrs_0.3.6
[88] pillar_1.4.7 lifecycle_0.2.0 rhdf5filters_1.2.0
[91] lmtest_0.9-38 GlobalOptions_0.1.2 estimability_1.3
[94] RcppAnnoy_0.0.18 cowplot_1.1.1 bitops_1.0-6
[97] irlba_2.3.3 httpuv_1.5.5 patchwork_1.1.1
[100] R6_2.5.0 promises_1.1.1 KernSmooth_2.23-18
[103] parallelly_1.23.0 codetools_0.2-18 MASS_7.3-53
[106] assertthat_0.2.1 rjson_0.2.20 withr_2.4.0
[109] GenomicAlignments_1.26.0 sctransform_0.3.2 Rsamtools_2.6.0
[112] GenomeInfoDbData_1.2.4 mgcv_1.8-33 rpart_4.1-15
[115] tidyr_1.1.2 rmarkdown_2.6 Cairo_1.5-12.2
[118] Rtsne_0.15 shiny_1.5.0 tinytex_0.29
This is the expected behavior and I'm not sure how to fix this. This happens at this line of code: https://github.com/GreenleafLab/ArchR/blob/2f022a448d8248a0f9afb33419bcbaeafe7731c0/R/MultiModal.R#L89
The feature matrix in the hdf5 file (accessed via features <- h5read(featureMatrix, "/matrix/features")
) has a column called "interval" which gives chromosome coordinates. In the 10x multiome example datasets, the interval column is all "NA" for the mitochondrial genes and thus these genes are excluded from the SummarizedExperiment.
> rowData(se)[grep(pattern = "MT-", x = rownames(rowData(se))),]
DataFrame with 13 rows and 5 columns
feature_type genome id interval name
<Rle> <Rle> <array> <array> <array>
MT-ND1 Gene Expression GRCh38 ENSG00000198888 NA MT-ND1
MT-ND2 Gene Expression GRCh38 ENSG00000198763 NA MT-ND2
MT-CO1 Gene Expression GRCh38 ENSG00000198804 NA MT-CO1
MT-CO2 Gene Expression GRCh38 ENSG00000198712 NA MT-CO2
MT-ATP8 Gene Expression GRCh38 ENSG00000228253 NA MT-ATP8
... ... ... ... ... ...
MT-ND4L Gene Expression GRCh38 ENSG00000212907 NA MT-ND4L
MT-ND4 Gene Expression GRCh38 ENSG00000198886 NA MT-ND4
MT-ND5 Gene Expression GRCh38 ENSG00000198786 NA MT-ND5
MT-ND6 Gene Expression GRCh38 ENSG00000198695 NA MT-ND6
MT-CYB Gene Expression GRCh38 ENSG00000198727 NA MT-CYB
I'm not sure why these genes dont have annotated intervals (that seems like a 10x problem?). Presumably you could try to manually edit this yourself.
Anything I'm missing @jgranja24 ?
Hi Ryan,
Thank you for the reply!
I wanted to filter some cells based on their mitochondrial gene counts. What do you think is the easiest way to do this? For example, can I import other data formats to ArchR such as h5ad or Seurat files?
For your information, I contacted 10x about this, and it looks like they don't include chrM gene coordinates as they exclude them for peak calling in cellranger.
In the ARC pipeline, we consider chrM as the non_nuclear_contigs (https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/release-notes/references#GRCh38-2020-A):
non_nuclear_contigs: [\"chrM\"]
Here are some more detail about the 'non_nuclear_contigs' (https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/advanced/references):
non_nuclear_contigs (Optional; list of strings) name(s) of contig(s) that do not have any chromatin structure, for example, mitochondria or plastids. For the GRCh38 assembly this would be ["chrM"]. These contigs are excluded from peak calling since the entire contig will be "open" due to a lack of chromatin structure.
As you can see, chrM is excluded from peak calling. Given that you are working with GEX+ATAC joint analysis, we did not include the full information of genes in chrM in GEX results as well.
Thanks for posting the information from 10x. I think the most straightforward way to do this would be to use the RNA data alone to identify barcodes that you want to remove based on MT- genes and then remove these cells from the ArchR project using subsetting. This will require some manual work on your end and is not part of a standard ArchR workflow at the moment but we will take this into account as this is a common filtering step in scRNA-seq
Hi Ryan,
Thanks, that would be fantastic. For now I’ll subset cells based on barcodes as you suggested.
Better late than never. We've done a big overhaul to the multiomic import functions and now make it possible to retain chrM genes. This is currently available on dev
and will be incorporated into the next stable release. To do this, you need to tell ArchR what the gene intervals are for these genes via the features
argument. This is because CellRanger doesnt provide those gene positions.
https://github.com/GreenleafLab/ArchR/blob/bb0d391cec77ba751410518beac824470629dd74/R/MultiModal.R#L19-L22
Thank you so much! This is fantastic