single-cell-best-practices
single-cell-best-practices copied to clipboard
Changed to decoupler for pseudobulking (#141)
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
View / edit / reply to this conversation on ReviewNB
Zethson commented on 2023-02-03T14:13:21Z ----------------------------------------------------------------
Uhm, why are they all outcommented?
alitinet commented on 2023-02-03T14:28:26Z ----------------------------------------------------------------
oh thanks, forgot to remove, we don't need it any more
View / edit / reply to this conversation on ReviewNB
Zethson commented on 2023-02-03T14:13:22Z ----------------------------------------------------------------
Line #2. adata_pb = dc.get_pseudobulk(adata, sample_col='sample', groups_col='cell_type', layer='counts', min_prop=0.2, min_smpls=3)
Quite a long line, I'd add a line break after every parameter.
View / edit / reply to this conversation on ReviewNB
Zethson commented on 2023-02-03T14:13:23Z ----------------------------------------------------------------
Line #1. sc.pp.normalize_total(adata_pb, target_sum=1e4)
What made you do this? Think most normalize to millions?
alitinet commented on 2023-02-03T14:29:53Z ----------------------------------------------------------------
following decoupler tutorial here https://decoupler-py.readthedocs.io/en/latest/notebooks/pseudobulk.html
Zethson commented on 2023-02-03T14:31:05Z ----------------------------------------------------------------
I'd not change this without discussing it with Soroor
View / edit / reply to this conversation on ReviewNB
Zethson commented on 2023-02-03T14:13:24Z ----------------------------------------------------------------
The dimensions are now very very different.
Before: 16 x 15710
Now: 16 x 2435
Intended? Could you explain this, please?
alitinet commented on 2023-02-03T14:41:34Z ----------------------------------------------------------------
It comes from https://decoupler-py.readthedocs.io/en/latest/generated/decoupler.get_pseudobulk.html#decoupler.get_pseudobulk params min_prop=0.2 and min_smpls=3, which filter out genes that are expressed in <20% of all cells and genes that are expressed in <3 samples
alitinet commented on 2023-02-03T14:43:41Z ----------------------------------------------------------------
I'm not sure if it'd better to make these more permissive
View / edit / reply to this conversation on ReviewNB
Zethson commented on 2023-02-03T14:13:25Z ----------------------------------------------------------------
The new plot looks pretty different from the old one. What happened?
alitinet commented on 2023-02-03T14:41:49Z ----------------------------------------------------------------
Didn't notice, thanks! will fix
following decoupler tutorial here https://decoupler-py.readthedocs.io/en/latest/notebooks/pseudobulk.html
View entire conversation on ReviewNB
It comes from https://decoupler-py.readthedocs.io/en/latest/generated/decoupler.get_pseudobulk.html#decoupler.get_pseudobulk params min_prop=0.2 and min_smpls=3, which filter out genes that are expressed in <20% of all cells and genes that are expressed in <3 samples
View entire conversation on ReviewNB
Hey @soroorh, we changed to decoupler for pseudobulk creation which filters out genes that are expressed in <20% of all cells and genes that are expressed in < 3 samples. After this step, we are left with 2435 genes out of original 15710 genes. Do you think it would make sense to make this filtering step more permissive or is it ok as it is?
Also, for pb normalization, should we use 1e6 (as before) or 1e4 (from decoupler tutorial) as the normalizing factor?
Hey @alitinet. Since you are using filterByExpr
from edgeR later in the workflow, I would retain as much genes as possible. So, I'd go with a more permissive filtering or no filtering. Also perhaps explicitly explain this in the chapter that if one is following edgeR's workflow for DE, they do not need to apply any filtering when making pseudo-bulks.
I would go with 1e6, as this is closer to how counts-per-million (CPM) is computed in edgeR.
@alitinet this chapter should then also use the pertpy dataloader for the kang dataset!
https://pertpy.readthedocs.io/en/latest/usage/data/pertpy.data.kang_2018.html#pertpy.data.kang_2018