scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

filter_rank_genes_groups fold change calculation

Open coh-racng opened this issue 4 years ago • 9 comments

filter_rank_genes_groups handles log values differently than the rank_genes_groups function. The discrepancy gives different DE gene lists when filtering genes based on log2fc_min=1 with get.rank_genes_groups_df vs min_fold_change=2 with tl.filter_rank_genes_groups

In rank_genes_groups, np.expm1 is used:

foldchanges = (np.expm1(means[imask]) + 1e-9) / (np.expm1(mean_rest) + 1e-9)
rankings_gene_logfoldchanges.append(np.log2(foldchanges[global_indices]))

In filter_rank_genes_groups, np.exp is used:

if log:
    fold_change_matrix.loc[:, cluster] = (np.exp(mean_obs.loc[True]) / np.exp(mean_obs.loc[False])).values

coh-racng avatar Oct 07 '19 18:10 coh-racng

grrrrrrr

gianasco avatar Dec 02 '19 11:12 gianasco

and may be also this??
abs(fold_change_matrix) > min_fold_change

there is also no need to change the help in https://icb-scanpy.readthedocs-hosted.com/en/stable/api/scanpy.tl.filter_rank_genes_groups.html

gianasco avatar Dec 02 '19 11:12 gianasco

I want to second this issue!! I just spent many hours digging into the source code to figure out why filter_rank_genes_groups was filtering out genes that reported really high fold changes from rank_genes_groups, only to discover the discrepancy in the fold change calculation.

Here is an example of how confusing this inconsistency can be:

  • I run rank_genes_groups and see that many marker genes have high log2 fold changes in adata.uns['rank_genes_groups']['logfoldchanges'][<cluster_string>]. For example, gene X has a fold change of -27.720167.
  • Then, I run filter_rank_genes_groups -- and none of these genes with high negative fold changes are retained
  • There are two issues here: one is that negative fold changes don't get retained at all. [This is the issue I notice first, and report in #1325]. I fix that in my fork of the repo (solution below), but STILL these genes are removed when filtering for a min absolute fold change of 1.5 (0.58 on log scale)... ?!
  • This boils down to the inconsistency in fold change calculation. Mean expression of gene X within my cluster of interest is 0, and outside it is 0.1997576. np.log2((0 + 1e-9)/(0.1997576 + 1e-9)) = -27.720167, as reported originally by rank_genes_groups. As a user, I completely expect this gene to pass my threshold. filter_rank_genes_groups, however, calculates fold change as np.log2(np.exp(0)/np.exp(0.199758)) = -0.288189, which does NOT pass my fold change threshold, thus it gets filtered out. All this happens silently of course [the only number I have seen is a whopping fold change of -27] leaving me utterly confused.

I'm not sure which is more correct (though -27 seems pretty inflated to me given the raw numbers), but it would make a lot more sense for it to at least be consistent, especially so that filter_rank_genes_groups could give expected results.

p.s. Here is my fix to retain downregulated genes in filter_rank_genes_groups: update the third condition to (np.absolute(np.log2(fold_change_matrix)) > np.log2(min_fold_change)) (similar to @gianasco's suggestion, but handles downregulated fold changes more appropriately). I noted this issue separately in #1325

rpeys avatar Jul 21 '20 00:07 rpeys

@Koncopd, since you're already looking at the DE code, would you mind taking a look at this? We should definitely be consistent about how we calculate this.

ivirshup avatar Jul 31 '20 07:07 ivirshup

@ivirshup yes, i'll check.

Koncopd avatar Jul 31 '20 08:07 Koncopd

Has this been solved? By the way I see some cases where filtering doesn't work. I still see a lot of genes not passing the fold threshold surviving. And depleted genes are also there

brianpenghe avatar Sep 22 '20 14:09 brianpenghe

Has this been solved?

jefferyUstc avatar Oct 12 '20 12:10 jefferyUstc

For all those asking whether this has been updated, I'm not a contributor to this repo but from reading the source code, it looks like this has indeed been updated in the latest version of scanpy. From what I can tell,

  • np.log2((expm1_func(mean_in_cluster) + 1e-9) / (expm1_func(mean_out_cluster) + 1e-9)) is now consistently used to calculate and filter fold changes, in rank_genes_groups and filter_rank_genes_groups respectively.
  • The default value for arg min_fold_change=2 has also been changed from 2 to 1, which makes sense. i.e. won't filter out genes based on fold change unless the user explicitly asks it to
  • Still no fix for the issue where negative fold changes get automatically filtered out in filter_rank_genes_groups

rpeys avatar Feb 10 '21 05:02 rpeys

Looks like this has been fixed. There is an option now to compare absolute fold changes if you set compare_abs=True.

racng avatar Jun 08 '22 20:06 racng