scenicplus icon indicating copy to clipboard operation
scenicplus copied to clipboard

Failing to integrate scATAC + scRNA cells (non multiome data). Error in rule prepare_GEX_ACC_non_multiome:

Open twinklevhatkar opened this issue 1 year ago • 3 comments

Hi! I am fairly new to the Scenic plus tool. I have been able to successfully install and get the snakemake workflow running. I have been working with non multiome data. My workflow runs just fine until it reaches "prepare_GEX_ACC_non_multiome:" module. Because it is non multiomics data, I know the cell barcodes will not match. I have the "GEX:annotation" metadata present for both my assays which helps integrating the two assays.

Here's the log file with my error:

Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 2 Rules claiming more threads will be scaled down. Job stats: job count


AUCell_direct 1 AUCell_extended 1 all 1 eGRN_direct 1 eGRN_extended 1 get_search_space 1 prepare_GEX_ACC_non_multiome 1 prepare_menr 1 region_to_gene 1 scplus_mudata 1 tf_to_gene 1 total 11

Select jobs to execute... Execute 1 jobs...

[Mon Jul 15 00:20:13 2024] localrule prepare_GEX_ACC_non_multiome: input: /home/scenicplus/outs/cistopic_obj_final.pkl, /home/scenicplus/outs/integrated_fibro_harmony_final.h5ad output: /home/scenicplus/outs/ACC_GEX.h5mu jobid: 2 reason: Missing output files: /home/scenicplus/outs/ACC_GEX.h5mu resources: tmpdir=/tmp

[Mon Jul 15 00:20:14 2024] Error in rule prepare_GEX_ACC_non_multiome: jobid: 2 input: /home/scenicplus/outs/cistopic_obj_final.pkl, /home/scenicplus/outs/integrated_fibro_harmony_final.h5ad output: /home/scenicplus/outs/ACC_GEX.h5mu shell:

        scenicplus prepare_data prepare_GEX_ACC                 --cisTopic_obj_fname /home/scenicplus/outs/cistopic_obj_final.pkl                 --GEX_anndata_fname /home/scenicplus/outs/integrated_fibro_harmony_final.h5ad                 --out_file /home/scenicplus/outs/ACC_GEX.h5mu                 --bc_transform_func                  --is_not_multiome                 --key_to_group_by GEX:annotation                 --nr_cells_per_metacells 10
        
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-07-15T002013.095171.snakemake.log WorkflowError: At least one job did not complete successfully.

######## My log file doesn't indicate where the issue is. Any help would be appreciated. I am not sure what I'm missing here?!

Thanks!

2024-07-15T002013.095171.snakemake.log

twinklevhatkar avatar Jul 15 '24 17:07 twinklevhatkar

Hi @twinklevhatkar

Could you run

scenicplus prepare_data prepare_GEX_ACC \
                 --cisTopic_obj_fname /home/scenicplus/outs/cistopic_obj_final.pkl \
                 --GEX_anndata_fname /home/scenicplus/outs/integrated_fibro_harmony_final.h5ad \
                 --out_file /home/scenicplus/outs/ACC_GEX.h5mu \
                 --bc_transform_func \
                  --is_not_multiome \
                 --key_to_group_by GEX:annotation \
                 --nr_cells_per_metacells 10

This might reveal the underlying error.

Best,

Seppe

SeppeDeWinter avatar Jul 29 '24 14:07 SeppeDeWinter

Hello @SeppeDeWinter,

I have encountered the exactly same issue.

The log file of the command is:

2025-04-16 15:46:47,472 SCENIC+      INFO     Reading cisTopic object.
2025-04-16 15:47:02,294 SCENIC+      INFO     Reading gene expression AnnData.
2025-04-16 15:47:06,626 cisTopic     INFO     Imputing region accessibility
2025-04-16 15:47:06,626 cisTopic     INFO     Impute region accessibility for regions 0-20000
2025-04-16 15:47:18,123 cisTopic     INFO     Impute region accessibility for regions 20000-40000
2025-04-16 15:47:29,669 cisTopic     INFO     Impute region accessibility for regions 40000-60000
2025-04-16 15:47:41,185 cisTopic     INFO     Impute region accessibility for regions 60000-80000
2025-04-16 15:47:52,678 cisTopic     INFO     Impute region accessibility for regions 80000-100000
2025-04-16 15:48:04,522 cisTopic     INFO     Impute region accessibility for regions 100000-120000
2025-04-16 15:48:16,117 cisTopic     INFO     Impute region accessibility for regions 120000-140000
2025-04-16 15:48:27,730 cisTopic     INFO     Impute region accessibility for regions 140000-160000
2025-04-16 15:48:39,328 cisTopic     INFO     Impute region accessibility for regions 160000-180000
2025-04-16 15:48:50,922 cisTopic     INFO     Impute region accessibility for regions 180000-200000
2025-04-16 15:49:02,727 cisTopic     INFO     Impute region accessibility for regions 200000-220000
2025-04-16 15:49:14,329 cisTopic     INFO     Impute region accessibility for regions 220000-240000
2025-04-16 15:49:26,006 cisTopic     INFO     Impute region accessibility for regions 240000-260000
2025-04-16 15:49:35,629 cisTopic     INFO     Done!
2025-04-16 15:49:35,700 Ingesting non-multiome data INFO     Following annotations were found in both assays under key Major_clust:
	Astrocyte, Oligodendrocyte, Inh_NDNF, Inh_PVALB, Exc_L6CT, Exc_L3_5IT, Exc_L2/3IT, Exc_L5ET, Microglia, Exc_L5_6NP, Inh_LAMP5, Inh_VIP, Inh_SST, OPC, Exc_L6IT, Exc_L6b.
Keeping 109682 cells for RNA and 91765 for ATAC.
2025-04-16 15:54:45,488 Ingesting non-multiome data INFO     Automatically set `nr_metacells` to: Astrocyte: 644, Exc_L2/3IT: 1614, Exc_L3_5IT: 1366, Exc_L5ET: 66, Exc_L5_6NP: 86, Exc_L6CT: 186, Exc_L6IT: 370, Exc_L6b: 146, Inh_LAMP5: 258, Inh_NDNF: 264, Inh_PVALB: 556, Inh_SST: 468, Inh_VIP: 368, Microglia: 700, OPC: 700, Oligodendrocyte: 10376
2025-04-16 15:54:45,488 Ingesting non-multiome data INFO     Generating pseudo multi-ome data
[Wed Apr 16 16:18:42 2025]
Error in rule prepare_GEX_ACC_non_multiome:
    jobid: 2
    input: /home/zhangfy/data_projects/SCENIC+/data_out/cisTopicObject_annotated.pkl, /home/zhangfy/data_projects/SCENIC+/data_in/human_snRNA/adata_new.h5ad
    output: ACC_GEX.h5mu
    shell:
        
            scenicplus prepare_data prepare_GEX_ACC                 --cisTopic_obj_fname /home/zhangfy/data_projects/SCENIC+/data_out/cisTopicObject_annotated.pkl                 --GEX_anndata_fname /home/zhangfy/data_projects/SCENIC+/data_in/human_snRNA/adata_new.h5ad                 --out_file ACC_GEX.h5mu                 --bc_transform_func "lambda x: f'{x}'"                 --is_not_multiome                 --key_to_group_by Major_clust                 --nr_cells_per_metacells 10
            
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Btw, The h5ad file is created according to the guidelines. But the mtx.gz, features.gz, barcode.gz files are exported from Seurat. There wasn't any problem with the scanpy procedure. scripts are:

import scanpy as sc
import pandas as pd
import anndata as ad
import numpy as np
from scipy.sparse import csr_matrix

import scanpy as sc
adata = sc.read_10x_mtx(
    "/home/zhangfy/data_projects/SCENIC+/data_in/feature_bc_matrix",
    var_names = "gene_symbols"
)

adata.var_names_make_unique()

cell_data = pd.read_csv("/home/zhangfy/data_projects/SCENIC+/data_in/human_snRNA/RNA_anno_hs.csv", index_col = 0)
# I have make sure that the barcodes from cell_data and adata are the same:
if adata.obs_names.equals(cell_data.index):
    print("The cell barcodes in adata and cell_data are consistent.")
else:
    print("The cell barcodes in adata and cell_data are NOT consistent.")

adata.obs = cell_data

#normalize the data
adata.raw = adata
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.pp.scale(adata, max_value=10)

sc.tl.pca(adata)
sc.pl.pca(adata, color = "Major_clust")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color = "Major_clust")

adata.write("/home/zhangfy/data_projects/SCENIC+/data_in/human_snRNA/adata_new.h5ad")

Also, the cisTopic object is created using:

# Create cisTopic object
from pycisTopic.cistopic_class import *
sparse_matrix = mmread(projDir+"data_in/sparse_matrix.mtx").tocsr()
region_names = pd.read_csv(projDir+"data_in/region_names.txt")['x'].tolist() #256366√
cell_names = pd.read_csv(projDir+"data_in/cell_names.txt")['x'].tolist() #91765√
cistopic_obj = create_cistopic_object(
    fragment_matrix=sparse_matrix,  
    cell_names=cell_names,        
    region_names=region_names     
)
# Adding cell information
cell_data =  pd.read_csv(projDir+'data_in/metadata.txt', sep=' ')
cell_data.columns
cistopic_obj.add_cell_data(cell_data)


#the rest are the same as the guidelines

Thank you!

Best, Feiyang

JerryZhang-1222 avatar Apr 16 '25 09:04 JerryZhang-1222

I see what happened.

in the .yml file, one need to specify the output files in the section output_data by not just addressing the name of the file, but also the path to the file. for example:

(WRONG):combined_GEX_ACC_mudata: "ACC_GEX.h5mu"
(CORRECT):combined_GEX_ACC_mudata: "<path_to_the_file>/ACC_GEX.h5mu"

I think this might be the cause of the problem.

Anyway, thank you for SCENIC+.

Best, Feiyang.

JerryZhang-1222 avatar Apr 16 '25 15:04 JerryZhang-1222