Improve documentation for violin plot to illustrate per-gene usage – violin(adata.T, 'gene_property')
Hey! I thought plotting .var columns in sc.pl.violin() worked previously (and the error message seems to suggest this as well).
I am doing the following:
adata_counts.var['dropout_per_gene'] = (adata_counts.X > 0).mean(0)
adata_counts.obs['dropout_per_cell'] = (adata_counts.X > 0).mean(1)
sc.pl.violin(adata_counts, keys='dropout_per_gene')
and I get this error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-8-463060c90a0b> in <module>()
----> 1 sc.pl.violin(adata_counts, keys='dropout_per_gene')
~/scanpy/scanpy/plotting/anndata.py in violin(adata, keys, groupby, log, use_raw, stripplot, jitter, size, scale, order, multi_panel, show, xlabel, rotation, save, ax, **kwds)
630 X_col = adata.raw[:, key].X
631 else:
--> 632 X_col = adata[:, key].X
633 obs_df[key] = X_col
634 if groupby is None:
~/anndata/anndata/base.py in __getitem__(self, index)
1303 def __getitem__(self, index):
1304 """Returns a sliced view of the object."""
-> 1305 return self._getitem_view(index)
1306
1307 def _getitem_view(self, index):
~/anndata/anndata/base.py in _getitem_view(self, index)
1306
1307 def _getitem_view(self, index):
-> 1308 oidx, vidx = self._normalize_indices(index)
1309 return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
1310
~/anndata/anndata/base.py in _normalize_indices(self, index)
1283 obs, var = super(AnnData, self)._unpack_index(index)
1284 obs = _normalize_index(obs, self.obs_names)
-> 1285 var = _normalize_index(var, self.var_names)
1286 return obs, var
1287
~/anndata/anndata/base.py in _normalize_index(index, names)
261 return slice(start, stop, step)
262 elif isinstance(index, (int, str)):
--> 263 return name_idx(index)
264 elif isinstance(index, (Sequence, np.ndarray, pd.Index)):
265 # here, we replaced the implementation based on name_idx with this
~/anndata/anndata/base.py in name_idx(i)
248 raise IndexError(
249 'Key "{}" is not valid observation/variable name/index.'
--> 250 .format(i))
251 i = i_found[0]
252 return i
IndexError: Key "dropout_per_gene" is not valid observation/variable name/index.
The whole thing works for:
sc.pl.violin(adata_counts.T, keys='dropout_per_gene')
sc.pl.violin(adata_counts, keys='dropout_per_cell)
So it's clearly just not taking .var columns for sc.pl.violing().
I've also reproduced this with adata = sc.datasets.blob().
I just noticed this is not implemented. Maybe we could just mirror the sc.pl.scatter() setup for sc.pl.violin(). If this is a good idea, I could quickly implement this...
Not only violin but most plotting options do not consider what you have in .var. I this guarantees consistency in that what you see, for example in a violin plot, is always per obs. I think that taking the transpose, as you did, should be the right solution.
So you are saying this is a feature and not a bug 😄 . Does that mean you think one should not be able to plot .var covariates by default?
Of course! would be wild if the plotting would internally transpose the anndata object in case one of the provided keys exists in .var. sc.pl.violin(adata.T, 'key') is 100% the right thing to do.
I think the docs are a bit improvable though:
keys : str or list of str Keys for accessing variables of .var_names or fields of .obs.
The mention of var_names here means that you can select one or more genes to plot. How can we phrase that better? Maybe we should also add an example that uses transposing.
That's the internal structure that is in the sc.pl.scatter() function. Could just replicate that.
ouch, it’s pretty error prone to just guess! What if a column is in both .var and .obs? People will never figure out what they need to do in order to get what they want.
I don’t like replicating that or that it ever went into any function. Explicit is better than implicit.
We could throw a nice error if the column isn’t in .obs but is in .var instead, like
You specified column “dropout_per_gene” which is not in
.obs, but in.var. Did you mean to callsc.pl.violin(adata.T, ...)?
@falexwolf and I discussed this at the time, and came to the conclusion we should be able to assume that the .var columns will not be ambiguously named. And as it defaults to .obs, it works the same as it would have done before anyway. I see the issue though, as scanpy functions do write to .obs and .var. On the one hand we would have to check that unambiguous naming is always upheld which can be difficult with scanpy growing, and on the other hand users may not be aware of what columns there are already in .obs when they name .var columns.
While I like the workaround with adata.T, it feels strange to tell people to transpose their data to overcome a technical restriction in the plotting function. Explicit is the better solution there.
you wanted to say “implicit” right?
and I disagree, transposing here is exactly the right thing to do: when you want to use your genes as observations in a plot you should transpose your AnnData so that they are the dimension made for observations!
also, there’s no “technical restriction”. it’s about API design; we could also introduce a switch_var_and_obs = False parameter, but i feel .T is simpler.
No, I meant an argument like over='cells' that would default to .obs covariates. So explicitly telling the function. Implicit would be nice, but you pointed out that it would require guessing, which has its own issues. I'm on the fence about your solution of doing neither and requiring the user to use adata.T.
I do see your rationale behind 'variables' and 'observations' though. I'm just not entirely sure that is clear to the user in the same way it is clear to the developer. As a user I see cells and genes in my dataset and may not be aware that one of them are treated as the variables that describe the other. Then the question is: do you want to be as user-friendly as possible (I'll call it 'the R way') or stick with consistent conventions that may not be clear to everyone ('the numpy way'?). Both can cause frustrations and both have benefits.
what if i told you that we can have our cake and eat it too? as said before:
We could throw a nice error if the column isn’t in
.obsbut is in.varinstead, likeYou specified column “dropout_per_gene” which is not in
.obs, but in.var. Did you mean to callsc.pl.violin(adata.T, ...)?
Near zero frustration, because people can just do what the error tells them.
I like @flying-sheep's very last solution. To enable this for truly large-scale data and AnnData's that are backed on disk we need a much more efficient transposition implementation, which will probably need to return a view. That's problematic as it will break backwards compat (.T returns a copy these days). But it's good as it will allow adding fields to .var.
@LuckyMD: At the time, when you mentioned that you wanted to plot over genes in scatter, I was fine with with having the scatter wrapper and assuming no ambiguity in obs and var keys. Now, I'd advocate for @flying-sheep's solution. Of course, we'll maintain the feature in pl.scatter when refactoring its code (a lot of it became redundant after fidel introduced the completely rewritten scatter plots).
Okay... so shall I make a PR for this? Or do you quickly want to change it @flying-sheep?
Sure, go ahead!