anndata icon indicating copy to clipboard operation
anndata copied to clipboard

AnnData concat: index_unique - rename?

Open jeskowagner opened this issue 2 years ago • 3 comments

Hi,

The experimental anndata.concat function is really useful. I was wondering whether index_unique as argument name is a bit misleading, however.

Given the name, I was expecting it to be a boolean, that switches on the use of keys to make the index unique. I see in the documentation that it not only switches on these unique indeces, but also defines the separator - so the documentation is plenty clear.

Perhaps renaming the argument to, say, index_unique_sep or something of the sort would be a more obvious hint as to the expected input type. Note that when passing a boolean the error is very non-obvious:

Error trace
UnboundLocalError                         Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 anndata.concat([bio,bio],keys=["bio","tech"],index_unique=True,join="inner")

File <...>/lib/python3.10/site-packages/anndata/_core/merge.py:847, in concat(adatas, axis, join, merge, uns_merge, label, keys, index_unique, fill_value, pairwise)
    843 concat_indices = pd.concat(
    844     [pd.Series(dim_indices(a, axis=axis)) for a in adatas], ignore_index=True
    845 )
    846 if index_unique is not None:
--> 847     concat_indices = concat_indices.str.cat(label_col.map(str), sep=index_unique)
    848 concat_indices = pd.Index(concat_indices)
    850 alt_indices = merge_indices(
    851     [dim_indices(a, axis=alt_axis) for a in adatas], join=join
    852 )

File <...>/lib/python3.10/site-packages/pandas/core/strings/accessor.py:125, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    120     msg = (
    121         f"Cannot use .str.{func_name} with values of "
    122         f"inferred dtype '{self._inferred_dtype}'."
    123     )
    124     raise TypeError(msg)
--> 125 return func(self, *args, **kwargs)

File <...>/lib/python3.10/site-packages/pandas/core/strings/accessor.py:633, in StringMethods.cat(self, others, sep, na_rep, join)
    630     result = cat_safe(all_cols, sep)
    631 else:
    632     # no NaNs - can just concatenate
--> 633     result = cat_safe(all_cols, sep)
    635 out: Index | Series
    636 if isinstance(self._orig, ABCIndex):
    637     # add dtype for case that result is all-NA

File <...>/lib/python3.10/site-packages/pandas/core/strings/accessor.py:3184, in cat_safe(list_of_columns, sep)
   3178         if dtype not in ["string", "empty"]:
   3179             raise TypeError(
   3180                 "Concatenation requires list-likes containing only "
   3181                 "strings (or missing values). Offending values found in "
   3182                 f"column {dtype}"
   3183             ) from None
-> 3184 return result

UnboundLocalError: local variable 'result' referenced before assignment

Best, Jesko

jeskowagner avatar Sep 01 '22 09:09 jeskowagner

See also: #629

I agree it's not a perfect name. It's a little difficult to convey that passing it is both a flag and the value. Do you think a better error would be enough? That way we could skip having a deprecation and rename.

ivirshup avatar Sep 01 '22 14:09 ivirshup

Sorry, completely missed the PR. I agree with your comments there but am still unsure. Pro renaming is that it is still an experimental feature; con that it seems to improve the situation only slightly. I think a better error would help users decipher the problem faster, although ideally one can get it right the first time. So it probably boils down to how many people are already using concat and would be impacted by the change.

PS sorry for the malformatted error trace earlier, fixed it now

jeskowagner avatar Sep 01 '22 15:09 jeskowagner

After sleeping on it, personally and for the sake of future proofing I would lean towards deprecation, but this is a lot easier to say when not being in the maintainer role.

Then again, it would break some open-source projects, like:

To name a few (see here for more). Most of these do not have a lot of stars (assuming they are at least somewhat representative of usage). But of course some of them are accompanied by papers and these changes would hamper their reproducibility.

jeskowagner avatar Sep 02 '22 08:09 jeskowagner