scanpy
scanpy copied to clipboard
filter_rank_genes_groups fold change calculation
filter_rank_genes_groups
handles log values differently than the rank_genes_groups
function. The discrepancy gives different DE gene lists when filtering genes based on log2fc_min=1
with get.rank_genes_groups_df
vs min_fold_change=2
with tl.filter_rank_genes_groups
In rank_genes_groups, np.expm1
is used:
foldchanges = (np.expm1(means[imask]) + 1e-9) / (np.expm1(mean_rest) + 1e-9)
rankings_gene_logfoldchanges.append(np.log2(foldchanges[global_indices]))
In filter_rank_genes_groups, np.exp
is used:
if log:
fold_change_matrix.loc[:, cluster] = (np.exp(mean_obs.loc[True]) / np.exp(mean_obs.loc[False])).values
grrrrrrr
and may be also this??
abs(fold_change_matrix) > min_fold_change
there is also no need to change the help in https://icb-scanpy.readthedocs-hosted.com/en/stable/api/scanpy.tl.filter_rank_genes_groups.html
I want to second this issue!! I just spent many hours digging into the source code to figure out why filter_rank_genes_groups
was filtering out genes that reported really high fold changes from rank_genes_groups
, only to discover the discrepancy in the fold change calculation.
Here is an example of how confusing this inconsistency can be:
- I run
rank_genes_groups
and see that many marker genes have high log2 fold changes inadata.uns['rank_genes_groups']['logfoldchanges'][<cluster_string>]
. For example, gene X has a fold change of -27.720167. - Then, I run
filter_rank_genes_groups
-- and none of these genes with high negative fold changes are retained - There are two issues here: one is that negative fold changes don't get retained at all. [This is the issue I notice first, and report in #1325]. I fix that in my fork of the repo (solution below), but STILL these genes are removed when filtering for a min absolute fold change of 1.5 (0.58 on log scale)... ?!
- This boils down to the inconsistency in fold change calculation. Mean expression of gene X within my cluster of interest is 0, and outside it is 0.1997576.
np.log2((0 + 1e-9)/(0.1997576 + 1e-9)) = -27.720167
, as reported originally byrank_genes_groups
. As a user, I completely expect this gene to pass my threshold.filter_rank_genes_groups
, however, calculates fold change asnp.log2(np.exp(0)/np.exp(0.199758)) = -0.288189
, which does NOT pass my fold change threshold, thus it gets filtered out. All this happens silently of course [the only number I have seen is a whopping fold change of -27] leaving me utterly confused.
I'm not sure which is more correct (though -27 seems pretty inflated to me given the raw numbers), but it would make a lot more sense for it to at least be consistent, especially so that filter_rank_genes_groups
could give expected results.
p.s. Here is my fix to retain downregulated genes in filter_rank_genes_groups
: update the third condition to (np.absolute(np.log2(fold_change_matrix)) > np.log2(min_fold_change))
(similar to @gianasco's suggestion, but handles downregulated fold changes more appropriately). I noted this issue separately in #1325
@Koncopd, since you're already looking at the DE code, would you mind taking a look at this? We should definitely be consistent about how we calculate this.
@ivirshup yes, i'll check.
Has this been solved? By the way I see some cases where filtering doesn't work. I still see a lot of genes not passing the fold threshold surviving. And depleted genes are also there
Has this been solved?
For all those asking whether this has been updated, I'm not a contributor to this repo but from reading the source code, it looks like this has indeed been updated in the latest version of scanpy. From what I can tell,
-
np.log2((expm1_func(mean_in_cluster) + 1e-9) / (expm1_func(mean_out_cluster) + 1e-9))
is now consistently used to calculate and filter fold changes, inrank_genes_groups
andfilter_rank_genes_groups
respectively. - The default value for arg
min_fold_change=2
has also been changed from 2 to 1, which makes sense. i.e. won't filter out genes based on fold change unless the user explicitly asks it to - Still no fix for the issue where negative fold changes get automatically filtered out in
filter_rank_genes_groups
Looks like this has been fixed. There is an option now to compare absolute fold changes if you set compare_abs=True
.