GLUE icon indicating copy to clipboard operation
GLUE copied to clipboard

Problems with scglue.models.integration_consistency in train.ipynb with few cells (~n=500).

Open ilibarra opened this issue 1 year ago • 1 comments

If subsetting ~500 cells for both RNA and ATAC, in the training tutorial

rna = rna[rna.obs_names.isin(rna.obs_names[:500]),:].copy()
atac = atac[atac.obs_names.isin(atac.obs_names[:500]),:].copy()
print(rna.shape, atac.shape)

I get the following error (below) when executing this snippet.

dx = scglue.models.integration_consistency(
    glue, {"rna": rna, "atac": atac}, guidance_hvf
)
dx
[INFO] integration_consistency: Using layer "counts" for modality "rna"
[INFO] integration_consistency: Selecting aggregation "sum" for modality "rna"
[INFO] integration_consistency: Selecting aggregation "sum" for modality "atac"
[INFO] integration_consistency: Selecting log-norm preprocessing for modality "rna"
[INFO] integration_consistency: Selecting log-norm preprocessing for modality "atac"
[INFO] get_metacells: Clustering metacells...
[WARNING] get_metacells: `faiss` is not installed, using `sklearn` instead... This might be slow with a large number of cells. Consider installing `faiss` following the guide from https://github.com/facebookresearch/faiss/blob/main/INSTALL.md
[INFO] get_metacells: Aggregating metacells...
[INFO] metacell_corr: Computing correlation on 1 common metacells...
/home/rio/miniconda3/envs/scglue/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:197: UserWarning: Some cells have zero counts
  warn(UserWarning('Some cells have zero counts'))
/home/rio/miniconda3/envs/scglue/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3440: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/home/rio/miniconda3/envs/scglue/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/home/rio/miniconda3/envs/scglue/lib/python3.8/site-packages/scglue/data.py:625: RuntimeWarning: invalid value encountered in double_scalars
  ((X[s] * X[t]).mean() - mean[s] * mean[t]) / (std[s] * std[t])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniconda3/envs/scglue/lib/python3.8/site-packages/pandas/core/indexes/base.py:3800, in Index.get_loc(self, key, method, tolerance)
   3799 try:
-> 3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:

File ~/miniconda3/envs/scglue/lib/python3.8/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/miniconda3/envs/scglue/lib/python3.8/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'sign'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In [16], line 1
----> 1 dx = scglue.models.integration_consistency(
      2     glue, {"rna": rna, "atac": atac}, guidance_hvf
      3 )
      4 dx

File ~/miniconda3/envs/scglue/lib/python3.8/site-packages/scglue/models/dx.py:99, in integration_consistency(model, adatas, graph, **kwargs)
     96     edgelist = nx.to_pandas_edgelist(corr)
     97     n_metas.append(n_meta)
     98     consistencies.append((
---> 99         edgelist["sign"] * edgelist["weight"] * edgelist["corr"]
    100     ).sum() / edgelist["weight"].sum())
    101 return pd.DataFrame({
    102     "n_meta": n_metas,
    103     "consistency": consistencies
    104 })

File ~/miniconda3/envs/scglue/lib/python3.8/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
   3803 if self.columns.nlevels > 1:
   3804     return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
   3806 if is_integer(indexer):
   3807     indexer = [indexer]

File ~/miniconda3/envs/scglue/lib/python3.8/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:
-> 3802     raise KeyError(key) from err
   3803 except TypeError:
   3804     # If we have a listlike key, _check_indexing_error will raise
   3805     #  InvalidIndexError. Otherwise we fall through and re-raise
   3806     #  the TypeError.
   3807     self._check_indexing_error(key)

KeyError: 'sign'

This does not happen when using a larger number of cells, or all 9,910 cells. I think it is related to the metacells assignment step. May I please ask for the logic behind how metacells are found, and how to possibly prevent it for very small samples in case those are not found? Thank you,

ilibarra avatar Oct 05 '22 13:10 ilibarra

Sorry for the late response. I tried but could not reproduce this error with subsampled cells.

In this particular example, it looks like the direct cause is a lack of edge attribute "sign" in the guidance graph guidance_hvf . Could you verify that the "sign" attribute exists in the graph edges?

As for the metacell assignment step, we used K-means clustering (with K = the number of metacells) to cluster cells from both modalities in the aligned embedding space. Cells in each K-means cluster from each modality are then pooled together into a single metacell, i.e., if a cluster contains both RNA and ATAC cells, it will produce a pair of RNA and ATAC metacells, but if a cluster contains only RNA/ATAC cells, it results in an orphaned metacell. Finally to compute the integration consistency score, we choses the paired metacells (discarding orphaned ones) to compute metacell correlation. If there are too few cells and the cells are not properly aligned, it is possible that this step will produce many orphaned metacells and few paired ones, which apparently will also be problematic. But I think that is not the direct cause of the error here.

Jeff1995 avatar Oct 11 '22 06:10 Jeff1995