anndata
anndata copied to clipboard
AnnData concat: index_unique - rename?
Hi,
The experimental anndata.concat
function is really useful. I was wondering whether index_unique
as argument name is a bit misleading, however.
Given the name, I was expecting it to be a boolean, that switches on the use of keys
to make the index unique. I see in the documentation that it not only switches on these unique indeces, but also defines the separator - so the documentation is plenty clear.
Perhaps renaming the argument to, say, index_unique_sep
or something of the sort would be a more obvious hint as to the expected input type. Note that when passing a boolean the error is very non-obvious:
Error trace
UnboundLocalError Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 anndata.concat([bio,bio],keys=["bio","tech"],index_unique=True,join="inner")
File <...>/lib/python3.10/site-packages/anndata/_core/merge.py:847, in concat(adatas, axis, join, merge, uns_merge, label, keys, index_unique, fill_value, pairwise)
843 concat_indices = pd.concat(
844 [pd.Series(dim_indices(a, axis=axis)) for a in adatas], ignore_index=True
845 )
846 if index_unique is not None:
--> 847 concat_indices = concat_indices.str.cat(label_col.map(str), sep=index_unique)
848 concat_indices = pd.Index(concat_indices)
850 alt_indices = merge_indices(
851 [dim_indices(a, axis=alt_axis) for a in adatas], join=join
852 )
File <...>/lib/python3.10/site-packages/pandas/core/strings/accessor.py:125, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
120 msg = (
121 f"Cannot use .str.{func_name} with values of "
122 f"inferred dtype '{self._inferred_dtype}'."
123 )
124 raise TypeError(msg)
--> 125 return func(self, *args, **kwargs)
File <...>/lib/python3.10/site-packages/pandas/core/strings/accessor.py:633, in StringMethods.cat(self, others, sep, na_rep, join)
630 result = cat_safe(all_cols, sep)
631 else:
632 # no NaNs - can just concatenate
--> 633 result = cat_safe(all_cols, sep)
635 out: Index | Series
636 if isinstance(self._orig, ABCIndex):
637 # add dtype for case that result is all-NA
File <...>/lib/python3.10/site-packages/pandas/core/strings/accessor.py:3184, in cat_safe(list_of_columns, sep)
3178 if dtype not in ["string", "empty"]:
3179 raise TypeError(
3180 "Concatenation requires list-likes containing only "
3181 "strings (or missing values). Offending values found in "
3182 f"column {dtype}"
3183 ) from None
-> 3184 return result
UnboundLocalError: local variable 'result' referenced before assignment
Best, Jesko
See also: #629
I agree it's not a perfect name. It's a little difficult to convey that passing it is both a flag and the value. Do you think a better error would be enough? That way we could skip having a deprecation and rename.
Sorry, completely missed the PR. I agree with your comments there but am still unsure. Pro renaming is that it is still an experimental feature; con that it seems to improve the situation only slightly. I think a better error would help users decipher the problem faster, although ideally one can get it right the first time. So it probably boils down to how many people are already using concat
and would be impacted by the change.
PS sorry for the malformatted error trace earlier, fixed it now
After sleeping on it, personally and for the sake of future proofing I would lean towards deprecation, but this is a lot easier to say when not being in the maintainer role.
Then again, it would break some open-source projects, like:
To name a few (see here for more). Most of these do not have a lot of stars (assuming they are at least somewhat representative of usage). But of course some of them are accompanied by papers and these changes would hamper their reproducibility.