anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Repeated column names in dataframe cause error during category validation

Open HelloWorldLTY opened this issue 1 year ago • 8 comments

Please make sure these conditions are met

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of anndata.
  • [X] (optional) I have confirmed this bug exists on the master branch of anndata.

Report

Code:

    def _remove_unused_categories(
        self, df_full: pd.DataFrame, df_sub: pd.DataFrame, uns: dict[str, Any]
    ):
        for k in df_full:
            if not isinstance(df_full[k].dtype, pd.CategoricalDtype):
                continue

Traceback:


I find a bug in the implemntation of updated anndata in the lines:

https://github.com/scverse/anndata/blob/f4ec10e62b924c535f7fcd51a94d9991552ced31/anndata/_core/anndata.py#L1180C1-L1185C25

It seems that pandas dataframe use dtypes rather than dtype for checking.

https://stackoverflow.com/questions/71784723/attributeerror-dataframe-object-has-no-attribute-dtype-appeared-suddenly

Therefore, I met an error here.

Versions

most updated version

HelloWorldLTY avatar Jan 23 '24 22:01 HelloWorldLTY

Thanks for the report.

Do you have an example of code that can trigger this, or at least a traceback?

Because k is a key of the dataframe in question, I believe df[k] should always be a Series. Is there a case where this would fail?

df: pd.DataFrame

for k in df:
    assert isinstance(df[k], pd.Series)

ivirshup avatar Jan 24 '24 18:01 ivirshup

I think if the varnames or other names duplicate, this error will happen, since df[k] then will have a dataframe. I am not sure if it is not allowed to have dupliacated names, since this is a warning rather than bug.

HelloWorldLTY avatar Jan 24 '24 18:01 HelloWorldLTY

Ah, that's right! But this will be only if there are duplicates in adata.{obs,var}.columns, not adata.{obs,var}.index. I do really dislike that pandas lets you do this.

Having an object like this will probably give you problems later (like with IO: https://github.com/scverse/anndata/issues/884).

I'm a little split between making this work, or throwing an informative error here.

ivirshup avatar Jan 25 '24 15:01 ivirshup

Thanks. I will avoid having such case in the future. I would prefer an error and add the var_names_unique() function in the reading step.

HelloWorldLTY avatar Jan 25 '24 15:01 HelloWorldLTY

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

github-actions[bot] avatar Mar 26 '24 02:03 github-actions[bot]

I see this as well! I accidentally had a repeated column in adata.var, and when I try to slice adata as in adata[logical_array], I get

AttributeError: 'DataFrame' object has no attribute 'dtype'

sjfleming avatar Apr 13 '24 06:04 sjfleming

This also happens with duplicated .obs columns, which is a common mistake when updating adata.obs with concatenation.

adata = sc.datasets.blobs()
new_col = adata.obs['blobs'].astype(int) + 1
adata.obs = pd.concat([adata.obs, new_col], axis=1)
adata[:, adata.var_names]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_189206/1060512404.py in ?()
      1 new_col = adata.obs['blobs'].astype(int) + 1
      2 adata.obs = pd.concat([adata.obs, new_col], axis=1)
      3 
----> 4 adata[:, adata.var_names]

/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/anndata/_core/anndata.py in ?(self, index)
   1083     def __getitem__(self, index: Index) -> AnnData:
   1084         """Returns a sliced view of the object."""
   1085         oidx, vidx = self._normalize_indices(index)
-> 1086         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)

/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/anndata/_core/anndata.py in ?(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
    265     ):
    266         if asview:
    267             if not isinstance(X, AnnData):
    268                 raise ValueError("`X` has to be an AnnData object.")
--> 269             self._init_as_view(X, oidx, vidx)
    270         else:
    271             self._init_as_actual(
    272                 X=X,

/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/anndata/_core/anndata.py in ?(self, adata_ref, oidx, vidx)
    321         self._obsp = adata_ref.obsp._view(self, oidx)
    322         self._varp = adata_ref.varp._view(self, vidx)
    323         # fix categories
    324         uns = copy(adata_ref._uns)
--> 325         self._remove_unused_categories(adata_ref.obs, obs_sub, uns)
    326         self._remove_unused_categories(adata_ref.var, var_sub, uns)
    327         # set attributes
    328         self._obs = DataFrameView(obs_sub, view_args=(self, "obs"))

/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/anndata/_core/anndata.py in ?(self, df_full, df_sub, uns)
   1088     def _remove_unused_categories(
   1089         self, df_full: pd.DataFrame, df_sub: pd.DataFrame, uns: dict[str, Any]
   1090     ):
   1091         for k in df_full:
-> 1092             if not isinstance(df_full[k].dtype, pd.CategoricalDtype):
   1093                 continue
   1094             all_categories = df_full[k].cat.categories
   1095             with pd.option_context("mode.chained_assignment", None):

/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/pandas/core/generic.py in ?(self, name)
   6295             and name not in self._accessors
   6296             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6297         ):
   6298             return self[name]
-> 6299         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'dtype'

emdann avatar May 24 '24 23:05 emdann

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

github-actions[bot] avatar Jul 24 '24 02:07 github-actions[bot]