scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Is automatic sanitisation necessary?

Open lazappi opened this issue 3 years ago • 9 comments

In several Scanpy functions sanitize_anndata() is called which converts most string columns in obs/var to categorical columns. Often this seems to be a precaution rather than really necessary and as a user it's frustrating to have things changed in a way you have no control over. Just wondering if you have considered how required this is and whether it would be better to let users control when/if this happens.

lazappi avatar Mar 16 '21 17:03 lazappi

Initial justification was that it makes a number of computations much faster and doesn't seem to cause problems. If you have examples of problems being caused, that would be great to know.

I get this is sort of like stringsAsFactors, but I think a lot the problems with that is avoided by categorical having a more sane api than factors. For instance, you can just interact with categorical arrays of strings as though it was an array of strings.

ivirshup avatar Mar 17 '21 04:03 ivirshup

So admittedly this is mostly a philosophical objection, it's probably unlikely to cause any computational problems. The main practical issue I have is converting between Python and R where this causes the type of columns to change. I realise this is kinda a niche problem though and doesn't practically make a lot of difference.

My main two philosophically objects are:

  1. Users should have control over what type things are (unless this is required for computational reasons)
  2. Functions shouldn't have side effects (i.e. I expect the highly_variable_genes() function to calculate the highly variable genes, not do that AND modify a bunch of unrelated columns in obs/var)

I can see the potential computational benefits in some cases but I wonder if that is actually true for everywhere it is used. It seems mostly like a case of "this might potentially make a difference so let's do it just in case". Where it does matter it's probably possible to get the same benefit without making permanent changes to the object.

(Also I realise this is a pretty minor thing so not trying to make a big deal out of it. It annoys me but maybe not worth making major changes for 😝.)

lazappi avatar Mar 17 '21 08:03 lazappi

Hey! Just to chime in, I believe plotting functions also expect categoricals and I've had errors from other functions as well about obs columns not being categorical. I think that was rank_genes_groups, but I'm not sure.

LuckyMD avatar Mar 17 '21 09:03 LuckyMD

What I don't like about it is that it turns nan and None to the strings "nan" and "None", respectively. This makes finding "nan" values in a column of adata.obs rather cumbersome.

But I believe this particular detail is also discussed in https://github.com/theislab/anndata/issues/141.

grst avatar Mar 17 '21 15:03 grst

@lazappi

Users should have control over what type things are (unless this is required for computational reasons)

I'm pretty sure I took your position back when this behaviour was added. The closest issue I could find was https://github.com/theislab/anndata/issues/115, but I remember having a longer discussion with @flying-sheep about this.

However, I'm now pretty convinced that strings as categoricals is pretty necessary for computational reasons. There isn't a good "array of strings" in python, so all operations on those kinds are reaaaaally slow. Also, most of the time strings really are encoding a categorical variable. If they aren't, they should be unique (so we don't convert).

Functions shouldn't have side effects (i.e. I expect the highly_variable_genes() function to calculate the highly variable genes, not do that AND modify a bunch of unrelated columns in obs/var)

I would generally agree with this. I don't like that highly_variable_genes will add mean and variance measures to the object. I guess I don't see converting strings to categoricals as being a big change, since most operations on them will have identical results.

I do see how this would cause problems with R since this isn't the case with factors. I wonder if there this could be solved in the converter? Maybe there is a better implementation of categorical values out there in the R ecosystem? Does stringr play nicer with factors?


@grst, that's bug, we definitely allow null values in categoricals, but it looks like they're being lost in conversion from strings.

ivirshup avatar Mar 18 '21 03:03 ivirshup

Hey! Just to chime in, I believe plotting functions also expect categoricals and I've had errors from other functions as well about obs columns not being categorical. I think that was rank_genes_groups, but I'm not sure.

This is definitely true but easy enough for the user to address if the error is clear (or handle internally as needed).

lazappi avatar Mar 18 '21 08:03 lazappi

I'm pretty sure I took your position back when this behaviour was added. The closest issue I could find was theislab/anndata#115, but I remember having a longer discussion with @flying-sheep about this.

However, I'm now pretty convinced that strings as categoricals is pretty necessary for computational reasons. There isn't a good "array of strings" in python, so all operations on those kinds are reaaaaally slow. Also, most of the time strings really are encoding a categorical variable. If they aren't, they should be unique (so we don't convert).

I would still argue it's better to complain and get the user to fix it or convert to a categorical internally for the purposes of the function. It's not the conversion that I find the issue it's that there is no way to control it. I would be totally fine with including sanitize_andata() in examples, just as a function called by the user rather than internally.

An example of a non-unique non-categorical variable would be alternative gene annotations. It's relatively common for more than one ENSEMBL idea to map to a gene symbol but that's not really a "category" that's useful for anything. Maybe the threshold of one repeated value is too high and it should be like 10%? That would probably become too unpredictable though.

I would generally agree with this. I don't like that highly_variable_genes will add mean and variance measures to the object. I guess I don't see converting strings to categoricals as being a big change, since most operations on them will have identical results.

I wouldn't say the means/variances are side effects, more of an intermediate value. That's what the function says it's calculating and they are useful to have for later. Changing unrelated columns is what bugs me.

I do see how this would cause problems with R since this isn't the case with factors. I wonder if there this could be solved in the converter?

I don't think there is any good way for a converter to know which columns are intended to be categorical/factors and which have just ended up that way after running Scanpy functions on the object. Getting factors back doesn't really make a practical difference (and there are advantages to storing things like that), it's just that it happens automatically with no way to control it I don't like.

Maybe there is a better implementation of categorical values out there in the R ecosystem? Does stringr play nicer with factors?

Not sure if there are alternatives to factors but I probably wouldn't use them even if there are. Users (and possibly functions) would be confused by non-standard types. Plus it assumes there is something inherently wrong with R factors which is another philosophical issue 😉. Also that wouldn't address the automatic conversion problem.

lazappi avatar Mar 18 '21 08:03 lazappi

To me it seems a bit intransparent when strings get converted to categoricals. Essentially, if I retrieve a certain column from anndata, I can't rely on it being string or categorical. It could be either and it may even change during the analysis.

If the columns were converted to categoricals the moment they are added to .obs, it would be a lot better IMO.

grst avatar Apr 28 '22 06:04 grst

Suggestion: https://github.com/scverse/anndata/pull/1474

falexwolf avatar Apr 22 '24 11:04 falexwolf