Repeated column names in dataframe cause error during category validation
Please make sure these conditions are met
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of anndata.
- [X] (optional) I have confirmed this bug exists on the master branch of anndata.
Report
Code:
def _remove_unused_categories(
self, df_full: pd.DataFrame, df_sub: pd.DataFrame, uns: dict[str, Any]
):
for k in df_full:
if not isinstance(df_full[k].dtype, pd.CategoricalDtype):
continue
Traceback:
I find a bug in the implemntation of updated anndata in the lines:
https://github.com/scverse/anndata/blob/f4ec10e62b924c535f7fcd51a94d9991552ced31/anndata/_core/anndata.py#L1180C1-L1185C25
It seems that pandas dataframe use dtypes rather than dtype for checking.
https://stackoverflow.com/questions/71784723/attributeerror-dataframe-object-has-no-attribute-dtype-appeared-suddenly
Therefore, I met an error here.
Versions
most updated version
Thanks for the report.
Do you have an example of code that can trigger this, or at least a traceback?
Because k is a key of the dataframe in question, I believe df[k] should always be a Series. Is there a case where this would fail?
df: pd.DataFrame
for k in df:
assert isinstance(df[k], pd.Series)
I think if the varnames or other names duplicate, this error will happen, since df[k] then will have a dataframe. I am not sure if it is not allowed to have dupliacated names, since this is a warning rather than bug.
Ah, that's right! But this will be only if there are duplicates in adata.{obs,var}.columns, not adata.{obs,var}.index. I do really dislike that pandas lets you do this.
Having an object like this will probably give you problems later (like with IO: https://github.com/scverse/anndata/issues/884).
I'm a little split between making this work, or throwing an informative error here.
Thanks. I will avoid having such case in the future. I would prefer an error and add the var_names_unique() function in the reading step.
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
I see this as well! I accidentally had a repeated column in adata.var, and when I try to slice adata as in adata[logical_array], I get
AttributeError: 'DataFrame' object has no attribute 'dtype'
This also happens with duplicated .obs columns, which is a common mistake when updating adata.obs with concatenation.
adata = sc.datasets.blobs()
new_col = adata.obs['blobs'].astype(int) + 1
adata.obs = pd.concat([adata.obs, new_col], axis=1)
adata[:, adata.var_names]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_189206/1060512404.py in ?()
1 new_col = adata.obs['blobs'].astype(int) + 1
2 adata.obs = pd.concat([adata.obs, new_col], axis=1)
3
----> 4 adata[:, adata.var_names]
/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/anndata/_core/anndata.py in ?(self, index)
1083 def __getitem__(self, index: Index) -> AnnData:
1084 """Returns a sliced view of the object."""
1085 oidx, vidx = self._normalize_indices(index)
-> 1086 return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/anndata/_core/anndata.py in ?(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
265 ):
266 if asview:
267 if not isinstance(X, AnnData):
268 raise ValueError("`X` has to be an AnnData object.")
--> 269 self._init_as_view(X, oidx, vidx)
270 else:
271 self._init_as_actual(
272 X=X,
/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/anndata/_core/anndata.py in ?(self, adata_ref, oidx, vidx)
321 self._obsp = adata_ref.obsp._view(self, oidx)
322 self._varp = adata_ref.varp._view(self, vidx)
323 # fix categories
324 uns = copy(adata_ref._uns)
--> 325 self._remove_unused_categories(adata_ref.obs, obs_sub, uns)
326 self._remove_unused_categories(adata_ref.var, var_sub, uns)
327 # set attributes
328 self._obs = DataFrameView(obs_sub, view_args=(self, "obs"))
/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/anndata/_core/anndata.py in ?(self, df_full, df_sub, uns)
1088 def _remove_unused_categories(
1089 self, df_full: pd.DataFrame, df_sub: pd.DataFrame, uns: dict[str, Any]
1090 ):
1091 for k in df_full:
-> 1092 if not isinstance(df_full[k].dtype, pd.CategoricalDtype):
1093 continue
1094 all_categories = df_full[k].cat.categories
1095 with pd.option_context("mode.chained_assignment", None):
/oak/stanford/groups/pritch/users/emma/miniforge3/envs/perturb-vs-tissue-env/lib/python3.10/site-packages/pandas/core/generic.py in ?(self, name)
6295 and name not in self._accessors
6296 and self._info_axis._can_hold_identifiers_and_holds_name(name)
6297 ):
6298 return self[name]
-> 6299 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'dtype'
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!