MultiMAP
MultiMAP copied to clipboard
Different feature dim numbers after PCA in example script?
Hi there,
Have read the preprint very nice one.
I am trying to run the example script in the project, and I found that, the input of MultiMAP.integration:
adata = MultiMAP.Integration([rna, atac_genes], ['X_pca', 'X_lsi'])
rna.obsm['X_pca'] has the dim (4382, 50) while atac_genes.obsm['X_lsi'] has the dim (3166, 49). atac_genes.obsm['X_lsi'] is the output of MultiMAP.TFIDF_LSI() in init.py and MultiMAP.TFIDF_LSI() called tfidf() in matrix.py
MultiMAP.TFIDF_LSI(atac_peaks)
atac_genes.obsm['X_lsi'] = atac_peaks.obsm['X_lsi'].copy()
I later checked in matrix and I think the dim number = 49 might due to the discarding of the first column of the sklearn.decomposition.TruncatedSVD() output?
# n_components passed to here is 50
def tfidf(X, n_components, binarize=True, random_state=0):
from sklearn.feature_extraction.text import TfidfTransformer
sc_count = np.copy(X)
if binarize:
sc_count = np.where(sc_count < 1, sc_count, 1)
tfidf = TfidfTransformer(norm='l2', sublinear_tf=True)
normed_count = tfidf.fit_transform(sc_count)
lsi = sklearn.decomposition.TruncatedSVD(n_components=n_components, random_state=random_state)
lsi_r = lsi.fit_transform(normed_count)
# Here↓↓↓↓
X_lsi = lsi_r[:, 1:]
return X_lsi
I wonder is the discarding of the column #0 is to remove the PC1 which usually strongly correlated to sequencing depth? In this way, the 2 inputs of MultiMAP.Integration() has PCA dim of 50 and 49 respectively although the function still runs normally and returns a result with dim (7548, 2), but, is that okay to do so? I have an impression reading the preprint that the 2 dataset to be integrated should have the same PC dim number after PCA reduction, because the inter-dataset point distance need to be calculated. Please could you correct me if my understanding is wrong.