GSVA Result of gsva is all NA

Dear authors, I am now using ssgsea method in gsva to evaluate the samples' enrichment score in a dataset, but the result is all NA, I would be very grateful if you could give me some guidance on that.

Belowing please kindly find the result: `> dim(expr_m) [1] 20502 10122

ssGSEA_matrix <- gsva(expr = expr_m, gset.idx.list = my_signature, method = 'ssgsea', kcdf = "Gaussian", min.sz > 1, abs.ranking = TRUE) Estimating ssGSEA scores for 14 gene sets. [1] "Calculating ranks..." [1] "Calculating absolute values from ranks..." |=========================================================================================================| 100%

[1] "Normalizing..." Warning messages: 1: In .filterFeatures(expr, method) : 4180 genes with constant expression values throuhgout the samples. 2: In .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq, abs.ranking, : Some gene sets have size one. Consider setting 'min.sz > 1'.

ssGSEA_matrix[1:5,1:5] TCGA-OR-A5J1-01 TCGA-OR-A5J2-01 TCGA-OR-A5J3-01 TCGA-OR-A5J5-01 TCGA-OR-A5J6-01 Immune.enrichment.score NA NA NA NA NA Immune.cell.subsets NA NA NA NA NA Immune.signaling.molecules NA NA NA NA NA X13.T.cell.signature NA NA NA NA NA T.cells NA NA NA NA NA `

However, if I only use part of my dataset, I can get the results:

`> ssGSEA_matrix <- gsva(expr = expr_m[,1:100], gset.idx.list = my_signature, method = 'ssgsea', kcdf = "Gaussian", min.sz > 1, abs.ranking = TRUE) Estimating ssGSEA scores for 14 gene sets. [1] "Calculating ranks..." [1] "Calculating absolute values from ranks..." |=========================================================================================================| 100%

[1] "Normalizing..." Warning messages: 1: In .filterFeatures(expr, method) : 571 genes with constant expression values throuhgout the samples. 2: In .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq, abs.ranking, : Some gene sets have size one. Consider setting 'min.sz > 1'.

ssGSEA_matrix[1:5,1:5] TCGA-OR-A5J1-01 TCGA-OR-A5J2-01 TCGA-OR-A5J3-01 TCGA-OR-A5J5-01 TCGA-OR-A5J6-01 Immune.enrichment.score 0.07036719 0.04767565 -0.001057149 0.00694892 0.13481281 Immune.cell.subsets -0.24078220 -0.22031222 -0.299557598 -0.29809871 -0.18673913 Immune.signaling.molecules -0.25096208 -0.19650415 -0.319315871 -0.30783535 -0.18980250 X13.T.cell.signature -0.06467369 -0.03327788 -0.166935307 -0.18018616 0.02333241 T.cells
`

May 07 '24 19:05 YihengJiang0912

Dear Yiheng Jiang,

the easiest way to solve your problem probably is to upgrade to the latest version, i.e. R-4.4.0, Bioconductor 3.19 and GSVA 1.52.0 (or later), brandnew and available since last week.

In that latest version, the old API for GSVA that you are currently using is defunct and you would have to change your code to use e.g.:

sp <- ssgseaParam(exprData=expr_m, geneSets=my_signature, minSize=10)
ssGSEA_matrix <- gsva(sp)

In addition, please note that

arguments kcdf and absRanking apply only to method GSVA and have no effect when using ssGSEA.
argument min.sz, now renamed to minSize, must be a numerical value specifying the minimum size for gene sets to use; in the example above I have set it to 10. (I realize that the sentence Consider setting 'min.sz > 1'., as used by the warning message, is misleading and we should change it.)

However, even going back to the latest version accepting your code ( R-4.3.3, Bioconductor 3.18, GSVA 1.50.5) I have tried and could not reproduce your problem using test data -- are there NA values in your data in expr_m[, 101:10122] ? What does e.g. any(is.na(expr_m)) say?

If upgrading doesn't solve your problem and it is not caused by NA values in your data, please let me know and do not forget to include the output of sessionInfo().

May 08 '24 15:05 axelklenk

Hi,

I may be having a problem that could be related. Our datasets in particular have different genes missing per-sample, so we cannot eliminate the set of genes that were not measured prior to running ssGSEA, we instead have to work with a sparse matrix. With the files that I've attached, running the following code

source("gene_sets.txt")
source("expr_mat.txt")
params <- GSVA::ssgseaParam(exprData = expr_mat, 
                            geneSets = gene_sets,
                            normalize = TRUE)
res <- GSVA::gsva(params)

results in a matrix that contains only NAs. I trace the issue to the normalization step here:

es <- es[, 1:n, drop=FALSE] / (range(es)[2] - range(es)[1])

Since if there are NAs in the input matrix, range(es) is NA NA, all values in the matrix after normalization become NA. We are currently using the following fix:

es <- es[, 1:n, drop=FALSE] / (range(es, na.rm = TRUE)[2] - range(es, na.rm = TRUE)[1])

Also, we remove NA values from the gene ranking before sending them to the .fastRndWalk function here:

    geneRanking <- order(R[, j], decreasing=TRUE, na.last = NA)
    geneSetsRankIdx <- lapply(match(geneSets, geneRanking), na.omit)

Do you think this makes sense to do?

Thank you!

expr_mat.txt gene_sets.txt session_info.txt

Jun 21 '24 20:06 bobzimmermann-D4C

Hi, thanks for attaching the data. Looking at it we see that actually most of the data are NA values, for instance:

dim(expr_mat)
[1]  668 1281
naspergene <- rowSums(is.na(expr_mat))
quantile(naspergene, prob=0.5)
   50% 
1246.5 
1247/ncol(expr_mat)
[1] 0.9734582

so, 50% of your rows (genes?) have NA values in 97% of the columns (cells?) . What kind of data is this? Is this single-cell data?

Jun 25 '24 13:06 rcastelo

Hi,

Thanks for looking into this. This is actually a subset of data from CPTAC. The data are taken from antibody probes, of which there are several per gene, which we aggregate to the gene-level by taking the mean. In fact we wouldn't use this data set in particular, but it served well in illustrating how this can be a problem.

Nevertheless, we can see that this can be a problem taking the setup that is illustrated in the vignette:

set.seed(42)
p <- 10000 ## number of genes
n <- 30    ## number of samples
X <- matrix(rnorm(p*n), nrow=p,
            dimnames=list(paste0("g", 1:p), paste0("s", 1:n)))
gs <- as.list(sample(10:100, size=100, replace=TRUE))
gs <- lapply(gs, \(n, p) paste0("g", sample(1:p, size=n)), p)
names(gs) <- paste0("gs", 1:length(gs))

And adding a more reasonable number of NAs:

na_rate <- 0.01
na_inds <- sample(seq_along(X), na_rate*length(X))
X_1pct <- X
X_1pct[na_inds] <- NA

Running ssGSEA with normalization results in a matrix with only NAs:

param <- GSVA::ssgseaParam(X_1pct, gs)
res <- GSVA::gsva(param)
all(is.na(res))
[1] TRUE

In fact if we use only one NA, we get the same result:

X_1 <- X
X_1[gs[[1]][[1]],1] <- NA
param <- GSVA::ssgseaParam(X_1, gs)
res <- GSVA::gsva(param)
all(is.na(res))
[1] TRUE

Without normalizing the results are different

param <- GSVA::ssgseaParam(X_1pct, gs, normalize = FALSE)
res <- GSVA::gsva(param)
sum(is.na(res))/length(res)
[1] 0.4166667

param <- GSVA::ssgseaParam(X_1, gs, normalize = FALSE)
res <- GSVA::gsva(param)
sum(is.na(res))
[1] 2

With the above proposed fixes, no NAs are generated in the results. Do you agree that this represents a good solution?

Thank you!

Jun 25 '24 14:06 bobzimmermann-D4C

Hi, so, this is some kind of proteomics data, right? My perception is that the amount of NAs is massive at least in the dataset you attached. Typically for proteomics data, one would discard rows (proteins?) and columns (samples?) with too many NAs, and attempt imputing values to make the data set complete.

The approach you are proposing consists of discarding NAs values altogether and we would be happy to provide some implementation for that approach, probably similar to what you propose, giving some warning message about the presence of NA values. The question here is whether the results make sense. Do you have some way to verify whether the results you obtain after removing NA values make sense?

Jun 25 '24 15:06 rcastelo

Hi,

Yes, the number of NAs there is greater than normal because I illustrated this with an aggregation of a subset of the data.

Below I illustrate a comparison of using the ssGSEA method (with the above fix) and mean aggregation to determine if the protein expression values are enriched in groups of patients who showed a positive response to a drug in a clinical trial. REACTOME pathways which show association with response with p <= 0.1 are shown on the x axis with mean aggregation (i.e. the mean of all genes in the gene set, dropping NAs) and ssGSEA on the y-axis. We use the signed log10 p-value to show that both the direction of effect and significance of the test are similar.

The given positively correlated pathways are in line with published data.

Let me know if you would like me to illustrate this in some other way. Thank you!

Best, Bob

Jun 25 '24 15:06 bobzimmermann-D4C

Hi, ok, I've opened a new issue, mentioned here above, to implement a missing data policy for ssGSEA. Thanks for the input. I'll let you know once this is part of the software. By now, I think it should be safe for you to use the patch you've implemented.

Jun 26 '24 17:06 rcastelo

Great! Looking forward.

Jun 26 '24 17:06 bobzimmermann-D4C

Hi @bobzimmermann-D4C we have implemented in the current devel branch of GSVA a first version of a missing data policy for ssGSEA. This is now available in devel version 1.53.4 through BiocManager::install() from a devel installation of Bioconductor; see https://contributions.bioconductor.org/use-devel.html. However, the current devel cycle runs on top of R 4.4.x, just as the release version of the packages, and this means that this devel version should not have problems with the underlying version of R. You could try installing it in your release version of Bioconductor by typing:

BiocManager::install("rcastelo/GSVA")

and if problems arise, then install back the release version of GSVA by doing:

BiocManager::install("GSVA", force=TRUE)

If you input proteomics data is stored in a matrix, ExpressionSet or a SummarizedExperiment object, just running:

library(GSVA)
param <- ssgseaParam(X, gs, use="na.rm")
res <- gsva(param)

should detect the presence of missing values and remove them from calculations. Please consult the parameters checkNA and use in the help page of ssgseaParam() for full details on how this is managed. I'd be very grateful if you can have a try with your data and let me know if this works as you expected. You may notice that the implementation is a bit more complex than what you proposed, this is to be able to consider different decisions that a user may want to take in the presence of missing data, and also to make dealing with missing data as performant as possible.

Jul 10 '24 07:07 rcastelo

Hi @rcastelo , thanks for this patch! We have tested it out on our test case and it appears to be working exactly as it should, but we appreciate the clarity of the interface you've provided. We will integrate this into our pipeline and let you know if anything else comes up.

Jul 15 '24 09:07 bobzimmermann-D4C

Hi @bobzimmermann-D4C I'm glad it's working as expected and thanks for pointing out the mistake in the code 🤦‍♂️ I've just corrected it 😅 one further question. My general understanding of mass-spec proteomic data analysis was that one typically imputes missing data values. I'd like to know why in this case you considered removing NAs a better solution than imputing, was it because of the extent of missing data being too large? could you point us out to a publication or data set where this is the case, in which is better to remove NAs rather than imputing them?

Jul 15 '24 09:07 rcastelo

Sure! This data is from CPTAC. From their publication available here, they write:

Finally, we focus on a challenge specific to PTMs. In the CPTAC data, we report quantitative measurement of phosphorylation and selected datasets also have data for acetylation and glycosylation. Although missing values are a regular part of all omics data, they are more pronounced in PTM data. One place where this is particularly problematic is pan-cancer analysis. If a PTM site is well quantified in one cancer type (e.g., EGFR tyrosine 1172), it may have many missing values in another, which would complicate a pan-cancer comparison of protein activation. One might be tempted to roll together all PTMs in a protein into a single measurement - e.g., the average phosphorylation state of EGFR. However, we advise against this, as PTMs at each site in a protein can be functionally independent and may not correlate across samples. Both experimental and computational approaches are being developed to improve PTM peptide identification, which will help alleviate the missing value problem in PTM proteomics

This doesn't specifically address the issue of pathway expression analysis, but in general for cancer genomics, it can be unwise to use imputation, and our policy has generally been to ignore rather than impute missing data. There may be different schools of thoughts on this, but especially with imbalanced cohorts, it can be especially critical not to try to regress to the mean.

Separately, I removed my comment about correcting the code above. 😊

Jul 15 '24 09:07 bobzimmermann-D4C

Excellent, thank you!! I'm closing the issue, but do not hesitate to open a new one if you encounter any problem or have any suggestion about the implementation of this missing data policy for ssGSEA.

Jul 15 '24 10:07 rcastelo