bigstatsr icon indicating copy to clipboard operation
bigstatsr copied to clipboard

Scaling predictor variables both in training and test datasets

Open annilk opened this issue 10 months ago • 4 comments

Hi,

I have run both genome- and metabolome-wide Lasso with big_spLinReg for a continuous trait. In both analyses, I have separated the training and test datasets to enable standardizing the phenotype and predictors in test data to the same scale as in the training data. For example, for metabolome-wide Lasso, I have used the following code:

# tr.dat includes columns for phenotype, covariates and metabolites and rows for training data individuals
# te.dat includes columns for phenotype, covariates and metabolites and rows for test data individuals
## ph.nm is name of the phenotype; covs is a vector for covariates; mtbs is a vector for metabolites
tr.dat = tr.dat %>% mutate_at(all_of(c(ph.nm,covs, mtbs)), list(~c(scale(.))))
te.dat = te.dat %>% mutate_at(all_of(c(ph.nm,covs, mtbs)), list(~c(scale(.))))

tr.X <- as_FBM(select(tr.dat, all_of(mtbs)))
te.X <- as_FBM(select(te.dat, all_of(mtbs)))

tr.y <- pull(tr.dat, all_of(ph.nm))
te.y <- pull(te.dat, all_of(ph.nm))

tr.covar.mat = covar_from_df(select(tr.dat, all_of(covs)))
te.covar.mat = covar_from_df(select(te.dat, all_of(covs)))

tr.N = length(tr.y)
te.N = length(te.y)

set.seed(1)

mod <- big_spLinReg(tr.X, tr.y, 
                    covar.train=as.matrix(tr.covar.mat), 
                    ind.train = 1:tr.N, K = 5, alphas = 1, power_scale = 1)

tr.pred <- predict(mod, tr.X, ind.row = 1:tr.N, covar.row = as.matrix(tr.covar.mat))
te.pred <- predict(mod, te.X, ind.row = 1:te.N, covar.row = as.matrix(te.covar.mat))

# functions to calculate R^2: 
mse <- function(x,y){mean((x-y)^2)}

r2 <- function(pred.y, true.y) {
  1 - mse(pred.y, true.y) / mse(true.y, mean(true.y))
}

# R^2 values:
tr.r2 = r2(tr.pred, tr.y)
te.r2 = r2(te.pred, te.y)

I have noticed that if I do not scale the metabolites, that is, if I replace the first two lines of code with

tr.dat = tr.dat %>% mutate_at(all_of(c(ph.nm, covs)), list(~c(scale(.))))
te.dat = te.dat %>% mutate_at(all_of(c(ph.nm, covs)), list(~c(scale(.))))

then the test R^2 te.r2 is a little lower (not much though). So standardizing the metabolites in the independent test data on the same scale as in training data improves model prediction in the independent test dataset.

Now I also noticed that for big_spLinReg there is the power_scale argument for which I have been using the default value 1. Should it then do essentially the same as I have done with the scale function to the training data? As far as I know, the implementation within scale is by default (i.e. when center = T, scale = T) something like (x - mean(x)) / sd(x) but based on the big_spLinReg documentation, the means are not subtracted from the respective predictors?

Since standardizing predictors in metabolome-wide Lasso improved model predictability in the test data set, I would like to do the same in genome-wide Lasso. However, I attempting to run the last two lines in the following code makes R session crash due to memory issues:

tr.rds <- bigsnpr::snp_readBed2(paste0(tr.genofile,".bed"), backingfile = tempfile()) # tr.genofile = name of the genotype data file incl. training samples
te.rds <- bigsnpr::snp_readBed2(paste0(te.genofile,".bed"), backingfile = tempfile()) # tr.genofile = name of the genotype data file incl. test samples

tr.snps <- snp_attach(tr.rds)
te.snps <- snp_attach(te.rds)
tr.G <- tr.snps$genotypes
te.G <- te.snps$genotypes

tr.G <- scale(tr.G[])
te.G <- scale(te.G[])

So is there some more memory-efficient way to do that? Or alternatively, if setting power_scale = 1 in big_spLinReg should do the job for training data, is there some similar argument in bigstatsr's predict function so that I would be able to standardize also predictors in the test data? So far, I have been able to run genome-wide Lasso only so that I skip the last two lines in the previous code chunk and then proceed as for the metabolome-wide Lasso, and R^2 resulting from the genome-wide Lasso for the test data is almost zero.

annilk avatar Apr 25 '24 12:04 annilk

  • big_spLinReg() may perform some scaling internally, but it will always return effect sizes corresponding to the input data (i.e. before any internal scaling).
  • I guess the way you compute r2 is wrong since it should be invariant of any scale of either pred and true; just use cor()^2 or pcor()^2 (possibly also keeping the sign).

privefl avatar Apr 25 '24 13:04 privefl

Thank you for the very fast response!

Do you mean that if I want my predictors to be on the same scale in test data than in training data, I should scale them externally? How to do that then in case of genotype data?

Thanks for pointing out the mistake in R2 calculation. However, I'm a little confused because I have been taught that when calculating R2, MSE should be used to ensure that predictions are on the correct scale of the outcome variable, and that using cor()^2 can be misleadingly optimistic. Can you explain why I should ignore the linear scale difference between the prediction and the outcome?

annilk avatar Apr 25 '24 13:04 annilk

I am not saying that you should not care about the prediction not being calibrated; I'm just saying that R2 does not measure calibration.

If you use the training data as is, which is then scale internally in big_spLinReg() but unscale to report the effects, then you should not have to scale the test data.

privefl avatar Apr 25 '24 20:04 privefl

Thank you again for your response. Can you also tell what do you mean by that "R2 does not measure calibration"?

annilk avatar May 10 '24 15:05 annilk

As I said, correlation is invariant from linear transformations, e.g.

x <- rnorm(100)
y <- rnorm(100)
cor(x, y)
x2 <- runif(1) * x + runif(1)
cor(x2, y)

So it cannot be used to measure calibration.

We are getting out-of-scope of the issues here.

privefl avatar May 11 '24 20:05 privefl