scanpy Improve documentation for violin plot to illustrate per-gene usage

Hey! I thought plotting .var columns in sc.pl.violin() worked previously (and the error message seems to suggest this as well).

I am doing the following:

adata_counts.var['dropout_per_gene'] = (adata_counts.X > 0).mean(0)
adata_counts.obs['dropout_per_cell'] = (adata_counts.X > 0).mean(1)

sc.pl.violin(adata_counts, keys='dropout_per_gene')

and I get this error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-8-463060c90a0b> in <module>()
----> 1 sc.pl.violin(adata_counts, keys='dropout_per_gene')

~/scanpy/scanpy/plotting/anndata.py in violin(adata, keys, groupby, log, use_raw, stripplot, jitter, size, scale, order, multi_panel, show, xlabel, rotation, save, ax, **kwds)
    630                 X_col = adata.raw[:, key].X
    631             else:
--> 632                 X_col = adata[:, key].X
    633             obs_df[key] = X_col
    634     if groupby is None:

~/anndata/anndata/base.py in __getitem__(self, index)
   1303     def __getitem__(self, index):
   1304         """Returns a sliced view of the object."""
-> 1305         return self._getitem_view(index)
   1306 
   1307     def _getitem_view(self, index):

~/anndata/anndata/base.py in _getitem_view(self, index)
   1306 
   1307     def _getitem_view(self, index):
-> 1308         oidx, vidx = self._normalize_indices(index)
   1309         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
   1310 

~/anndata/anndata/base.py in _normalize_indices(self, index)
   1283         obs, var = super(AnnData, self)._unpack_index(index)
   1284         obs = _normalize_index(obs, self.obs_names)
-> 1285         var = _normalize_index(var, self.var_names)
   1286         return obs, var
   1287 

~/anndata/anndata/base.py in _normalize_index(index, names)
    261         return slice(start, stop, step)
    262     elif isinstance(index, (int, str)):
--> 263         return name_idx(index)
    264     elif isinstance(index, (Sequence, np.ndarray, pd.Index)):
    265         # here, we replaced the implementation based on name_idx with this

~/anndata/anndata/base.py in name_idx(i)
    248                 raise IndexError(
    249                     'Key "{}" is not valid observation/variable name/index.'
--> 250                     .format(i))
    251             i = i_found[0]
    252         return i

IndexError: Key "dropout_per_gene" is not valid observation/variable name/index.

The whole thing works for:

sc.pl.violin(adata_counts.T, keys='dropout_per_gene')
sc.pl.violin(adata_counts, keys='dropout_per_cell)

So it's clearly just not taking .var columns for sc.pl.violing().

I've also reproduced this with adata = sc.datasets.blob().

Nov 21 '18 15:11 LuckyMD

I just noticed this is not implemented. Maybe we could just mirror the sc.pl.scatter() setup for sc.pl.violin(). If this is a good idea, I could quickly implement this...

Nov 21 '18 15:11 LuckyMD

Not only violin but most plotting options do not consider what you have in .var. I this guarantees consistency in that what you see, for example in a violin plot, is always per obs. I think that taking the transpose, as you did, should be the right solution.

Nov 22 '18 12:11 fidelram

So you are saying this is a feature and not a bug 😄 . Does that mean you think one should not be able to plot .var covariates by default?

Nov 22 '18 13:11 LuckyMD

Of course! would be wild if the plotting would internally transpose the anndata object in case one of the provided keys exists in .var. sc.pl.violin(adata.T, 'key') is 100% the right thing to do.

I think the docs are a bit improvable though:

keys : str or list of str Keys for accessing variables of .var_names or fields of .obs.

The mention of var_names here means that you can select one or more genes to plot. How can we phrase that better? Maybe we should also add an example that uses transposing.

Nov 22 '18 15:11 flying-sheep

That's the internal structure that is in the sc.pl.scatter() function. Could just replicate that.

Nov 22 '18 16:11 LuckyMD

ouch, it’s pretty error prone to just guess! What if a column is in both .var and .obs? People will never figure out what they need to do in order to get what they want.

I don’t like replicating that or that it ever went into any function. Explicit is better than implicit.

We could throw a nice error if the column isn’t in .obs but is in .var instead, like

You specified column “dropout_per_gene” which is not in .obs, but in .var. Did you mean to call sc.pl.violin(adata.T, ...)?

Nov 23 '18 11:11 flying-sheep

@falexwolf and I discussed this at the time, and came to the conclusion we should be able to assume that the .var columns will not be ambiguously named. And as it defaults to .obs, it works the same as it would have done before anyway. I see the issue though, as scanpy functions do write to .obs and .var. On the one hand we would have to check that unambiguous naming is always upheld which can be difficult with scanpy growing, and on the other hand users may not be aware of what columns there are already in .obs when they name .var columns.

While I like the workaround with adata.T, it feels strange to tell people to transpose their data to overcome a technical restriction in the plotting function. Explicit is the better solution there.

Nov 23 '18 11:11 LuckyMD

you wanted to say “implicit” right?

and I disagree, transposing here is exactly the right thing to do: when you want to use your genes as observations in a plot you should transpose your AnnData so that they are the dimension made for observations!

also, there’s no “technical restriction”. it’s about API design; we could also introduce a switch_var_and_obs = False parameter, but i feel .T is simpler.

Nov 23 '18 13:11 flying-sheep

No, I meant an argument like over='cells' that would default to .obs covariates. So explicitly telling the function. Implicit would be nice, but you pointed out that it would require guessing, which has its own issues. I'm on the fence about your solution of doing neither and requiring the user to use adata.T.

I do see your rationale behind 'variables' and 'observations' though. I'm just not entirely sure that is clear to the user in the same way it is clear to the developer. As a user I see cells and genes in my dataset and may not be aware that one of them are treated as the variables that describe the other. Then the question is: do you want to be as user-friendly as possible (I'll call it 'the R way') or stick with consistent conventions that may not be clear to everyone ('the numpy way'?). Both can cause frustrations and both have benefits.

Nov 23 '18 14:11 LuckyMD

what if i told you that we can have our cake and eat it too? as said before:

We could throw a nice error if the column isn’t in .obs but is in .var instead, like

You specified column “dropout_per_gene” which is not in .obs, but in .var. Did you mean to call sc.pl.violin(adata.T, ...)?

Near zero frustration, because people can just do what the error tells them.

Nov 23 '18 15:11 flying-sheep

I like @flying-sheep's very last solution. To enable this for truly large-scale data and AnnData's that are backed on disk we need a much more efficient transposition implementation, which will probably need to return a view. That's problematic as it will break backwards compat (.T returns a copy these days). But it's good as it will allow adding fields to .var.

@LuckyMD: At the time, when you mentioned that you wanted to plot over genes in scatter, I was fine with with having the scatter wrapper and assuming no ambiguity in obs and var keys. Now, I'd advocate for @flying-sheep's solution. Of course, we'll maintain the feature in pl.scatter when refactoring its code (a lot of it became redundant after fidel introduced the completely rewritten scatter plots).

Nov 25 '18 21:11 falexwolf

Okay... so shall I make a PR for this? Or do you quickly want to change it @flying-sheep?

Nov 26 '18 09:11 LuckyMD

Sure, go ahead!

Nov 26 '18 10:11 flying-sheep

Improve documentation for violin plot to illustrate per-gene usage – violin(adata.T, 'gene_property')