GSVA icon indicating copy to clipboard operation
GSVA copied to clipboard

Different results after package update

Open rominaappierdo opened this issue 7 months ago • 5 comments

Hi GSVA team,

First of all, thank you for maintaining and developing this great package.

I recently updated the gsva package and noticed that the results I obtain using the same input data and parameters now differ from those generated with the previous version. It seems that there has been an update to the way the ssgsea method is called and possibly to the underlying implementation.

Could you please clarify if there were changes in the algorithm or default parameters that would explain this behavior? Also, is there a way to reproduce the previous results using the new version of the package — perhaps by explicitly setting some parameters or using a legacy mode (if available)?

Thank you in advance for your help!

rominaappierdo avatar Jun 17 '25 10:06 rominaappierdo

Hi Romina, thanks for reaching out, during the last two years the ssGSEA implementation has been optimized for speed and memory consumption, but results should be the same. In fact, we routinely run so-called "regression" tests to make sure that results don't change as a result of altering the code. To track down what the problem can be, I would need you to provide a minimal reproducible example, including the corresponding session information.

rcastelo avatar Jun 17 '25 12:06 rcastelo

Hi Robert,

Thank you so much for your fast reply. Unfortunately, I'm not able to reproduce the example with the old version of the package, as it's no longer available. I might be able to dig up an old result table to try and compare, but I’m not sure yet. In the meantime, I can share the code I was using.

This was the old syntax:

hallmark_df <- gsva(
  as.matrix(expr),
  gset.idx.list = hallmark,
  min.sz = 5,
  parallel.sz = ncore,
  method = 'ssgsea',
  ssgsea.norm = FALSE,
  verbose = FALSE
)

And this is the new syntax I’m using now:

param <- ssgseaParam(
  exprData = as.matrix(expr),
  geneSets = hallmark,
  normalize = FALSE,
  minSize = 5
)
hallmark_df <- gsva(param = param)

Shouldn’t this give equivalent results? Let me know if there’s anything I might be missing in the translation.

Thanks again!

rominaappierdo avatar Jun 17 '25 15:06 rominaappierdo

Hi Romina, yes, it should give identical results, and it does. We introduced the new parameter-object based API in the release version of GSVA 1.50 (Bioconductor 3.18, see this link for examining when Bioconductor releases occur) in October 2023 (see GSVA NEWS file, starting then the deprecation process of the old one, i.e., the old one was in October 2023 still working. Running the R version available in spring 2023, which was R 4.3.1, allows me to run GSVA 1.48.3, which is the version before we introduced any such updates. I have run this two year old version of GSVA with the following code using some simulated data:

library(GSVA)

packageDescription("GSVA")$Version
[1] "1.48.3"
BiocManager::version()
[1] ‘3.17’

p <- 10 ## number of genes
n <- 30 ## number of samples

## simulate some toy gene sets
geneSets <- list(set1=paste("g", 1:3, sep=""),
                 set2=paste("g", 4:6, sep=""),
                 set3=paste("g", 7:10, sep=""))

## simulate some toy expression data with a seed for reproducibility
set.seed(123)
y <- matrix(rnorm(n*p), nrow=p, ncol=n,
            dimnames=list(paste("g", 1:p, sep="") , paste("s", 1:n, sep="")))

## run GSVA 1.48.3, Bioconductor release 3.17, spring 2023
ssgsea_es <- gsva(y, geneSets, method="ssgsea", ssgsea.norm=FALSE)

## save result to load it later
saveRDS(ssgsea_es, file="ssgsea_GSVA_1.48.3.rds")

Now run the current GSVA 2.2.0, Bioconductor release 3.21, spring 2025, load the previous results and compare:

library(GSVA)

packageDescription("GSVA")$Version                                                    
[1] "2.2.0"
BiocManager::version()                                                                
[1] ‘3.21’

p <- 10 ## number of genes
n <- 30 ## number of samples

## simulate some toy gene sets
geneSets <- list(set1=paste("g", 1:3, sep=""),
                 set2=paste("g", 4:6, sep=""),
                 set3=paste("g", 7:10, sep=""))

## simulate some toy expression data with a seed for reproducibility
set.seed(123)
y <- matrix(rnorm(n*p), nrow=p, ncol=n,
            dimnames=list(paste("g", 1:p, sep="") , paste("s", 1:n, sep="")))

## GSVA 2.2.0, Bioconductor release 3.21, spring 2025
param <- ssgseaParam(y, geneSets, normalize=FALSE)
ssgsea_es <- gsva(param)

## load results with GSVA 1.48.3
ssgsea_es_oldGSVA <- readRDS("ssgsea_GSVA_1.48.3.rds")

max(abs(ssgsea_es - ssgsea_es_oldGSVA))
[1] 0

attr(ssgsea_es, "geneSets") <- NULL
identical(ssgsea_es, ssgsea_es_oldGSVA)
[1] TRUE

As you can see the results between the two versions are completely identical. The only difference is not in the values of the enrichment scores, but in the fact that version 2.x of GSVA adds some metadata to the result in the form of attributes, but which once this metadata is dropped, even the identical() function in R says that both matrices are identical.

So, the difference that you are seeing in your results must originate somewhere else, you need to examine your whole pipeline and inputs that you give to GSVA.

rcastelo avatar Jun 18 '25 12:06 rcastelo

Thank you for your response.

I'm still a bit unsure about what's going on, but I noticed something that might be relevant. When I run ssGSEA, I get the following message:

Duplicated gene IDs removed from gene set KEGG_RIG_I_LIKE_RECEPTOR_SIGNALING_PATHWAY

Could this be part of the problem? Is it possible that the older version of the pipeline didn't remove duplicated genes, and this is causing the discrepancy I'm seeing?

Thanks again for your help!

rominaappierdo avatar Jun 18 '25 15:06 rominaappierdo

For me to help you figuring out exactly what is going on I would need to have the same input (expression data and gene sets) that you are using and the lines of code you are executing. You can share that by email to me at [email protected]. Without being able to reproduce what you are seeing, I can only speculate. I would expect that duplicated gene identifiers in a gene set should have a very little effect on the output, but as I said, it's just a guess.

rcastelo avatar Jun 18 '25 18:06 rcastelo

Hi Romina, thanks for your patience, somehow your email when to the spam folder, it took me a while to notice it, and then I've been traveling with little time to look at this, but I've finally done it and indeed the discrepancy comes from what you observed, that the older implementation of ssGSEA would use the gene sets directly as the user provides them, without removing duplicated genes, i.e., the user is entirely responsible for the input, while the newer implementation warns the user if duplicates are found, and removes them from the calculations. If you remove the duplicates doing this:

gene_sets <- lapply(gene_sets, unique)

before calling the older version of the package, you'll get the same results as with the newer version. For whatever reason, the gene sets you sent me have quite a few duplicated genes, this is just one gene set:

duplicated(gsets$KEGG_WNT_SIGNALING_PATHWAY)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

In fact, if you calculate the fraction of duplicated genes across gene set, there is even one with more than 85% of the genes duplicated:

summary(sapply(gsets, function(x) sum(duplicated(x))/length(x)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.03859 0.08244 0.13246 0.15051 0.86943 
which(fracdups > 0.85)
KEGG_GRAFT_VERSUS_HOST_DISEASE 
                            78

One or a few duplicated genes may not influence calculations much, but many may do. Obviously, one should expect that calculations without duplicated genes are more correct than with duplicated genes.

rcastelo avatar Jul 02 '25 17:07 rcastelo