liana-py icon indicating copy to clipboard operation
liana-py copied to clipboard

Liana on pseduo bulk-sc data data

Open Marwansha opened this issue 1 year ago • 6 comments

Hey, I want to know if it's best practice to always use single cell data on liana to compute the l-r ccc results, as I saw in the tutorial of differential expression, after computing differential expression on pseduobulk data, Liana was run on the adata object, and I want to ask if it's best practise and if it's ok to run it on the pseduobulk data?

Thanks Marwan

Marwansha avatar Mar 28 '24 19:03 Marwansha

Hi @Marwansha,

I assume you are referring to the DE analysis vignette. From my knowledge, it is the current best practice to perform differential testing between single-cell samples at the pseudobulk level.

A couple of reference on the topic:

https://www.nature.com/articles/s41576-023-00586-w https://www.nature.com/articles/s41467-021-21038-1

I hope this helps.

dbdimitrov avatar Mar 29 '24 10:03 dbdimitrov

Hi @Marwansha,

Sorry but not sure I exactly follow. Can you elaborate in what sense I run liana on pseudobulks?

One can also say that average expression per cluster is a "pseudobulk" (which is how the vast majority of CCC methods approach it).

In the DE tutorial, you can think of the li.mt.df_to_lr as a join of the DE stats with ligand-receptor prior knowledge. So, not necessarily running LIANA+ in the standard sense.

dbdimitrov avatar Mar 29 '24 14:03 dbdimitrov

sorry if i wasn't clear again,

my question is about generation the ligand-receptor interactions df for example simply here for a single cell data object : li.mt.rank_aggregate(adata, groupby='celltype', expr_prop=0.1, verbose=True)

my question is if i run the run aggreagate on the pseudobulk anndata object rather than the single cell object? li.mt.rank_aggregate(**_pdata_**, groupby='celltype', expr_prop=0.1, verbose=True) the one generated by decoupler, which here liana will treat each individal_celltype_ as 1 observation , so if we have 10 individuals, for 1 celltype we will be having 10 observations per condition, while in single cell data we got the no of cells per cell type as the observation

pdata = dc.get_pseudobulk(
    adata,
    groups_col="celltype",
    layer='counts',
    mode='sum',
    min_cells=10,
    min_counts=10000
)
pdata

i ran liana on the pseudobulk aggregated anndata object and the results make sense more for my data by comparing with the results from the single cell object as its much less noisy but i was not sure if this was tested before or which one is the best practice

Thanks

Marwansha avatar Mar 29 '24 14:03 Marwansha

Hi @Marwansha,

Sorry for the delay, I was on away.

Hmm. This is a really interesting approach, though not standard. If you have normalized (total + log1p) the summed counts, I see nothing wrong it with.

It only changes a bit the interpretation, since instead of comparing means across cells, you are comparing means across sample pseudobulks.

Just to share my intuition with this, think of CellPhoneDB. You get a mean between the averaged ligand and receptor expression per cluster (lr_mean), and you get a p-value where the averaging is done on permuted cell labels (cpdb_pvals). In your case, I believe the lr_mean ranking shouldn't change too much whether you do it on pseudobulks or at the single-cell level. However, the p-values should be quite different (since you are shuffling cell type pseudobulks per sample) and would likely be a bit more conservative.

In short, at a glance, I like it as an idea, and it can make sense depending on your data. You are also avoiding over-inflated permuted p-values due to pseudoreplication. :)

dbdimitrov avatar Apr 04 '24 05:04 dbdimitrov

PS. A major motivation of mine when writing liana-py was to make it flexible, so I'm glad to see when it's used in ways beyond the tutorials.

dbdimitrov avatar Apr 04 '24 05:04 dbdimitrov

Thank you very much for your response. In fact, I am trying to benchmark and compare the different results that come from computing the CCC (cell-cell communication) on single-cell or pseudobulk objects. From my perspective, and from a ground truth point of view (considering some ligand-receptor interactions that exist in one group and not in the other, which I know from literature and previous work), it seems that using the pseudobulk data makes it cleaner and easier to discern.

Would you be interested in having a short meeting? Maybe I can show you my data (I can share it too), so I can get some insights from your point of view on which approach makes more sense.

Thanks Marwan my email incase : [email protected]

Marwansha avatar Apr 04 '24 09:04 Marwansha