simba.read_embeddind() returns three embeddings instead of four for dual-omics 10X PBMCs dataset

Open AmirTEbi opened this issue 9 months ago • 0 comments

Hi, I replicated the SIMBA tutorial on the Multiomics integration of the 10x PBMC datasets and I got three embeddings instead of four. Specifically, I got embeddings for "C", "G", and "P" while in the tutorial the read_embedding method returns embeddings for "C", "C2", "G", and "P". I used datasets.multiome_10xpbmc10k method for loading the data. I don't know whether it's a problem with my implementation or SIMBA.

Also, I want to know if there is a way to evaluate SIMBA on unseen datasets. For instance, SIMBA is trained on the training data, and embeddings are generated based on the unseen test data from the trained model.

My OS is Ubuntu 22.04.3 LTS (GNU/Linux 6.8.0-40-generic x86_64). Below, I provided the code and the outputs that are different from the tutorial. Others are the same as the tutorial.

import os
import simba as si

dict_adata = si.datasets.multiome_10xpbmc10k()

adata_CP = dict_adata['atac']
adata_CG = dict_adata['rna']

adata_CP.obs.index = adata_CP.obs.index + '_atac'
adata_CG.obs.index = adata_CG.obs.index + '_rna'

adata_CG.obs.head()
adata_CP.obs.head()

si.pp.filter_peaks(adata_CP,min_n_cells=3)
si.pp.cal_qc_atac(adata_CP)

si.pl.violin(adata_CP,list_obs=['n_counts','n_peaks','pct_peaks'], list_var=['n_cells'])
si.pl.hist(adata_CP,list_obs=['n_counts','n_peaks','pct_peaks'], log=True, list_var=['n_cells'])

si.pp.pca(adata_CP, n_components=50)
si.pl.pca_variance_ratio(adata_CP)
si.pp.select_pcs(adata_CP,n_pcs=40)
si.pp.select_pcs_features(adata_CP)

si.pp.filter_genes(adata_CG,min_n_cells=3)
si.pp.cal_qc_rna(adata_CG)
si.pp.normalize(adata_CG,method='lib_size')
si.pp.log_transform(adata_CG)
si.pp.select_variable_genes(adata_CG, n_top_genes=4000)
si.pl.variable_genes(adata_CG,show_texts=True)
si.tl.discretize(adata_CG,n_bins=5)
si.pl.discretize(adata_CG,kde=False)
adata_CG_atac = si.tl.gene_scores(adata_CP,genome='hg38',use_gene_weigt=True, use_top_pcs=True)

adata_CrnaCatac = si.tl.infer_edges(adata_CG, adata_CG_atac, n_components=15, k=15)
si.tl.gen_graph(list_CP=[adata_CP],
                list_CG=[adata_CG],
                list_CC=[adata_CrnaCatac],
                copy=False,
                use_highly_variable=True,
                use_top_pcs=True,
                dirname='graph0')

Output:

#shared features: 3167
Performing randomized SVD ...
Searching for mutual nearest neighbors ...
99391 edges are selected

si.tl.gen_graph(list_CP=[adata_CP],
                list_CG=[adata_CG],
                list_CC=[adata_CrnaCatac],
                copy=False,
                use_highly_variable=True,
                use_top_pcs=True,
                dirname='graph0')

Output

`simba` does not exist in anndata 0 in `list_CP`.`.X` is being used instead.
relation0: source: C, destination: P
#edges: 79795295
relation1: source: C2, destination: G
#edges: 481504
relation2: source: C2, destination: G
#edges: 1097813
relation3: source: C2, destination: G
#edges: 1328809
relation4: source: C2, destination: G
#edges: 935647
relation5: source: C2, destination: G
#edges: 229885
relation6: source: C2, destination: C
#edges: 99391
Total number of edges: 83968344
Writing graph file "pbg_graph.txt" to "./result_simba/pbg/graph0" ...
Finished.

si.settings.pbg_params

Output:

{'entity_path': './result_simba/pbg/graph0/input/entity',
 'edge_paths': ['./result_simba/pbg/graph0/input/edge'],
 'checkpoint_path': 'result_simba/pbg/graph0/model',
 'entities': {'C': {'num_partitions': 1},
  'C2': {'num_partitions': 1},
  'G': {'num_partitions': 1},
  'P': {'num_partitions': 1}},
 'relations': [{'name': 'r0',
   'lhs': 'C',
   'rhs': 'P',
   'operator': 'none',
   'weight': 1.0},
  {'name': 'r1', 'lhs': 'C2', 'rhs': 'G', 'operator': 'none', 'weight': 1.0},
  {'name': 'r2', 'lhs': 'C2', 'rhs': 'G', 'operator': 'none', 'weight': 2.0},
  {'name': 'r3', 'lhs': 'C2', 'rhs': 'G', 'operator': 'none', 'weight': 3.0},
  {'name': 'r4', 'lhs': 'C2', 'rhs': 'G', 'operator': 'none', 'weight': 4.0},
  {'name': 'r5', 'lhs': 'C2', 'rhs': 'G', 'operator': 'none', 'weight': 5.0},
  {'name': 'r6', 'lhs': 'C2', 'rhs': 'C', 'operator': 'none', 'weight': 10.0}],
 'dynamic_relations': False,
 'dimension': 50,
 'global_emb': False,
 'comparator': 'dot',
 'num_epochs': 10,
 'workers': 4,
 'num_batch_negs': 50,
...
 'wd_interval': 50,
 'eval_fraction': 0.05,
 'eval_num_batch_negs': 50,
 'eval_num_uniform_negs': 50,
 'checkpoint_preservation_interval': None}

si.tl.pbg_train(auto_wd=True, save_wd=True, output='model')

And the embedding part:

dict_embs = si.read_embedding()
dict_embs

Output:

{'C': AnnData object with n_obs × n_vars = 9631 × 50,
 'G': AnnData object with n_obs × n_vars = 2000 × 50,
 'P': AnnData object with n_obs × n_vars = 59099 × 50}

Thank you.

Apr 09 '25 08:04 AmirTEbi