ArchR icon indicating copy to clipboard operation
ArchR copied to clipboard

Mitochondrial genes removed at import10xFeatureMatrix

Open kosekijkk opened this issue 3 years ago • 5 comments

Hi Jeffrey and Ryan,

I was testing the multiome pipeline following the tutorial below. https://greenleaflab.github.io/ArchR_2020/Ex-Analyze-Multiome.html

However, I found out that at the import10xFeatureMatrix function, all the mitochondrial genes were removed.

seRNA <- import10xFeatureMatrix(input = input_files, names = input_names)
Importing Feature Matrix 1 of 1
sum(str_detect(rownames(seRNA),'MT-'))
[1] 0

Is this the default behavior? I have confirmed that they exist in the cellranger output like below.

Screenshot 2021-03-02 210844

Thanks!

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[10] base     

other attached packages:
 [1] openxlsx_4.2.3                     ggrepel_0.9.1                     
 [3] circlize_0.4.12                    ComplexHeatmap_2.6.2              
 [5] gtable_0.3.0                       harmony_1.0                       
 [7] Rcpp_1.0.6                         uwot_0.1.10                       
 [9] gridExtra_2.3                      nabor_0.5.0                       
[11] BSgenome.Mmusculus.UCSC.mm10_1.4.0 SeuratDisk_0.0.0.9013             
[13] qusage_2.24.0                      limma_3.46.0                      
[15] stringr_1.4.0                      BSgenome.Hsapiens.UCSC.hg38_1.4.3 
[17] BSgenome_1.58.0                    rtracklayer_1.50.0                
[19] Biostrings_2.58.0                  XVector_0.30.0                    
[21] Seurat_3.2.3                       ArchR_1.0.1                       
[23] magrittr_2.0.1                     rhdf5_2.34.0                      
[25] Matrix_1.2-18                      data.table_1.13.6                 
[27] SummarizedExperiment_1.20.0        Biobase_2.50.0                    
[29] GenomicRanges_1.42.0               GenomeInfoDb_1.26.2               
[31] IRanges_2.24.1                     S4Vectors_0.28.1                  
[33] BiocGenerics_0.36.0                MatrixGenerics_1.2.0              
[35] matrixStats_0.57.0                 ggplot2_3.3.3                     

loaded via a namespace (and not attached):
  [1] plyr_1.8.6               igraph_1.2.6             lazyeval_0.2.2          
  [4] splines_4.0.3            BiocParallel_1.24.1      listenv_0.8.0           
  [7] scattermore_0.7          digest_0.6.27            htmltools_0.5.1.1       
 [10] fansi_0.4.2              tensor_1.5               cluster_2.1.0           
 [13] ROCR_1.0-11              globals_0.14.0           colorspace_2.0-0        
 [16] xfun_0.20                dplyr_1.0.3              crayon_1.3.4            
 [19] RCurl_1.98-1.2           jsonlite_1.7.2           spatstat_1.64-1         
 [22] spatstat.data_1.7-0      survival_3.2-7           zoo_1.8-8               
 [25] glue_1.4.2               polyclip_1.10-0          emmeans_1.5.3           
 [28] zlibbioc_1.36.0          leiden_0.3.6             GetoptLong_1.0.5        
 [31] DelayedArray_0.16.1      Rhdf5lib_1.12.0          future.apply_1.7.0      
 [34] shape_1.4.5              abind_1.4-5              scales_1.1.1            
 [37] mvtnorm_1.1-1            DBI_1.1.1                miniUI_0.1.1.1          
 [40] viridisLite_0.3.0        xtable_1.8-4             clue_0.3-58             
 [43] reticulate_1.18          bit_4.0.4                rsvd_1.0.3              
 [46] htmlwidgets_1.5.3        httr_1.4.2               RColorBrewer_1.1-2      
 [49] ellipsis_0.3.1           ica_1.0-2                pkgconfig_2.0.3         
 [52] XML_3.99-0.5             deldir_0.2-9             tidyselect_1.1.0        
 [55] rlang_0.4.10             reshape2_1.4.4           later_1.1.0.1           
 [58] munsell_0.5.0            tools_4.0.3              cli_2.2.0               
 [61] generics_0.1.0           ggridges_0.5.3           evaluate_0.14           
 [64] fastmap_1.0.1            yaml_2.2.1               goftest_1.2-2           
 [67] bit64_4.0.5              knitr_1.30               fitdistrplus_1.1-3      
 [70] zip_2.1.1                purrr_0.3.4              RANN_2.6.1              
 [73] pbapply_1.4-3            future_1.21.0            nlme_3.1-150            
 [76] mime_0.9                 hdf5r_1.3.3              compiler_4.0.3          
 [79] rstudioapi_0.13          plotly_4.9.3             png_0.1-7               
 [82] spatstat.utils_2.0-0     tibble_3.0.5             stringi_1.5.3           
 [85] lattice_0.20-41          fftw_1.0-6               vctrs_0.3.6             
 [88] pillar_1.4.7             lifecycle_0.2.0          rhdf5filters_1.2.0      
 [91] lmtest_0.9-38            GlobalOptions_0.1.2      estimability_1.3        
 [94] RcppAnnoy_0.0.18         cowplot_1.1.1            bitops_1.0-6            
 [97] irlba_2.3.3              httpuv_1.5.5             patchwork_1.1.1         
[100] R6_2.5.0                 promises_1.1.1           KernSmooth_2.23-18      
[103] parallelly_1.23.0        codetools_0.2-18         MASS_7.3-53             
[106] assertthat_0.2.1         rjson_0.2.20             withr_2.4.0             
[109] GenomicAlignments_1.26.0 sctransform_0.3.2        Rsamtools_2.6.0         
[112] GenomeInfoDbData_1.2.4   mgcv_1.8-33              rpart_4.1-15            
[115] tidyr_1.1.2              rmarkdown_2.6            Cairo_1.5-12.2          
[118] Rtsne_0.15               shiny_1.5.0              tinytex_0.29

kosekijkk avatar Mar 03 '21 02:03 kosekijkk

This is the expected behavior and I'm not sure how to fix this. This happens at this line of code: https://github.com/GreenleafLab/ArchR/blob/2f022a448d8248a0f9afb33419bcbaeafe7731c0/R/MultiModal.R#L89

The feature matrix in the hdf5 file (accessed via features <- h5read(featureMatrix, "/matrix/features")) has a column called "interval" which gives chromosome coordinates. In the 10x multiome example datasets, the interval column is all "NA" for the mitochondrial genes and thus these genes are excluded from the SummarizedExperiment.

> rowData(se)[grep(pattern = "MT-", x = rownames(rowData(se))),]
DataFrame with 13 rows and 5 columns
           feature_type genome              id interval    name
                  <Rle>  <Rle>         <array>  <array> <array>
MT-ND1  Gene Expression GRCh38 ENSG00000198888       NA  MT-ND1
MT-ND2  Gene Expression GRCh38 ENSG00000198763       NA  MT-ND2
MT-CO1  Gene Expression GRCh38 ENSG00000198804       NA  MT-CO1
MT-CO2  Gene Expression GRCh38 ENSG00000198712       NA  MT-CO2
MT-ATP8 Gene Expression GRCh38 ENSG00000228253       NA MT-ATP8
...                 ...    ...             ...      ...     ...
MT-ND4L Gene Expression GRCh38 ENSG00000212907       NA MT-ND4L
MT-ND4  Gene Expression GRCh38 ENSG00000198886       NA  MT-ND4
MT-ND5  Gene Expression GRCh38 ENSG00000198786       NA  MT-ND5
MT-ND6  Gene Expression GRCh38 ENSG00000198695       NA  MT-ND6
MT-CYB  Gene Expression GRCh38 ENSG00000198727       NA  MT-CYB

I'm not sure why these genes dont have annotated intervals (that seems like a 10x problem?). Presumably you could try to manually edit this yourself.

Anything I'm missing @jgranja24 ?

rcorces avatar Mar 03 '21 05:03 rcorces

Hi Ryan,

Thank you for the reply!

I wanted to filter some cells based on their mitochondrial gene counts. What do you think is the easiest way to do this? For example, can I import other data formats to ArchR such as h5ad or Seurat files?

kosekijkk avatar Mar 03 '21 16:03 kosekijkk

For your information, I contacted 10x about this, and it looks like they don't include chrM gene coordinates as they exclude them for peak calling in cellranger.

In the ARC pipeline, we consider chrM as the non_nuclear_contigs (https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/release-notes/references#GRCh38-2020-A):
non_nuclear_contigs: [\"chrM\"]

Here are some more detail about the 'non_nuclear_contigs' (https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/advanced/references): 

non_nuclear_contigs	(Optional; list of strings) name(s) of contig(s) that do not have any chromatin structure, for example, mitochondria or plastids. For the GRCh38 assembly this would be ["chrM"]. These contigs are excluded from peak calling since the entire contig will be "open" due to a lack of chromatin structure.

As you can see, chrM is excluded from peak calling. Given that you are working with GEX+ATAC joint analysis, we did not include the full information of genes in chrM in GEX results as well.

kosekijkk avatar Mar 03 '21 20:03 kosekijkk

Thanks for posting the information from 10x. I think the most straightforward way to do this would be to use the RNA data alone to identify barcodes that you want to remove based on MT- genes and then remove these cells from the ArchR project using subsetting. This will require some manual work on your end and is not part of a standard ArchR workflow at the moment but we will take this into account as this is a common filtering step in scRNA-seq

rcorces avatar Mar 03 '21 20:03 rcorces

Hi Ryan,

Thanks, that would be fantastic. For now I’ll subset cells based on barcodes as you suggested.

kosekijkk avatar Mar 03 '21 21:03 kosekijkk

Better late than never. We've done a big overhaul to the multiomic import functions and now make it possible to retain chrM genes. This is currently available on dev and will be incorporated into the next stable release. To do this, you need to tell ArchR what the gene intervals are for these genes via the features argument. This is because CellRanger doesnt provide those gene positions.

https://github.com/GreenleafLab/ArchR/blob/bb0d391cec77ba751410518beac824470629dd74/R/MultiModal.R#L19-L22

rcorces avatar Nov 09 '22 17:11 rcorces

Thank you so much! This is fantastic

kosekijkk avatar Nov 09 '22 19:11 kosekijkk