bigstatsr
bigstatsr copied to clipboard
Scaling predictor variables both in training and test datasets
Hi,
I have run both genome- and metabolome-wide Lasso with big_spLinReg
for a continuous trait. In both analyses, I have separated the training and test datasets to enable standardizing the phenotype and predictors in test data to the same scale as in the training data. For example, for metabolome-wide Lasso, I have used the following code:
# tr.dat includes columns for phenotype, covariates and metabolites and rows for training data individuals
# te.dat includes columns for phenotype, covariates and metabolites and rows for test data individuals
## ph.nm is name of the phenotype; covs is a vector for covariates; mtbs is a vector for metabolites
tr.dat = tr.dat %>% mutate_at(all_of(c(ph.nm,covs, mtbs)), list(~c(scale(.))))
te.dat = te.dat %>% mutate_at(all_of(c(ph.nm,covs, mtbs)), list(~c(scale(.))))
tr.X <- as_FBM(select(tr.dat, all_of(mtbs)))
te.X <- as_FBM(select(te.dat, all_of(mtbs)))
tr.y <- pull(tr.dat, all_of(ph.nm))
te.y <- pull(te.dat, all_of(ph.nm))
tr.covar.mat = covar_from_df(select(tr.dat, all_of(covs)))
te.covar.mat = covar_from_df(select(te.dat, all_of(covs)))
tr.N = length(tr.y)
te.N = length(te.y)
set.seed(1)
mod <- big_spLinReg(tr.X, tr.y,
covar.train=as.matrix(tr.covar.mat),
ind.train = 1:tr.N, K = 5, alphas = 1, power_scale = 1)
tr.pred <- predict(mod, tr.X, ind.row = 1:tr.N, covar.row = as.matrix(tr.covar.mat))
te.pred <- predict(mod, te.X, ind.row = 1:te.N, covar.row = as.matrix(te.covar.mat))
# functions to calculate R^2:
mse <- function(x,y){mean((x-y)^2)}
r2 <- function(pred.y, true.y) {
1 - mse(pred.y, true.y) / mse(true.y, mean(true.y))
}
# R^2 values:
tr.r2 = r2(tr.pred, tr.y)
te.r2 = r2(te.pred, te.y)
I have noticed that if I do not scale the metabolites, that is, if I replace the first two lines of code with
tr.dat = tr.dat %>% mutate_at(all_of(c(ph.nm, covs)), list(~c(scale(.))))
te.dat = te.dat %>% mutate_at(all_of(c(ph.nm, covs)), list(~c(scale(.))))
then the test R^2 te.r2
is a little lower (not much though). So standardizing the metabolites in the independent test data on the same scale as in training data improves model prediction in the independent test dataset.
Now I also noticed that for big_spLinReg
there is the power_scale
argument for which I have been using the default value 1. Should it then do essentially the same as I have done with the scale
function to the training data? As far as I know, the implementation within scale
is by default (i.e. when center = T, scale = T
) something like (x - mean(x)) / sd(x)
but based on the big_spLinReg documentation, the means are not subtracted from the respective predictors?
Since standardizing predictors in metabolome-wide Lasso improved model predictability in the test data set, I would like to do the same in genome-wide Lasso. However, I attempting to run the last two lines in the following code makes R session crash due to memory issues:
tr.rds <- bigsnpr::snp_readBed2(paste0(tr.genofile,".bed"), backingfile = tempfile()) # tr.genofile = name of the genotype data file incl. training samples
te.rds <- bigsnpr::snp_readBed2(paste0(te.genofile,".bed"), backingfile = tempfile()) # tr.genofile = name of the genotype data file incl. test samples
tr.snps <- snp_attach(tr.rds)
te.snps <- snp_attach(te.rds)
tr.G <- tr.snps$genotypes
te.G <- te.snps$genotypes
tr.G <- scale(tr.G[])
te.G <- scale(te.G[])
So is there some more memory-efficient way to do that? Or alternatively, if setting power_scale = 1
in big_spLinReg
should do the job for training data, is there some similar argument in bigstatsr's predict
function so that I would be able to standardize also predictors in the test data? So far, I have been able to run genome-wide Lasso only so that I skip the last two lines in the previous code chunk and then proceed as for the metabolome-wide Lasso, and R^2 resulting from the genome-wide Lasso for the test data is almost zero.
-
big_spLinReg()
may perform some scaling internally, but it will always return effect sizes corresponding to the input data (i.e. before any internal scaling). - I guess the way you compute r2 is wrong since it should be invariant of any scale of either
pred
andtrue
; just usecor()^2
orpcor()^2
(possibly also keeping the sign).
Thank you for the very fast response!
Do you mean that if I want my predictors to be on the same scale in test data than in training data, I should scale them externally? How to do that then in case of genotype data?
Thanks for pointing out the mistake in R2 calculation. However, I'm a little confused because I have been taught that when calculating R2, MSE should be used to ensure that predictions are on the correct scale of the outcome variable, and that using cor()^2
can be misleadingly optimistic. Can you explain why I should ignore the linear scale difference between the prediction and the outcome?
I am not saying that you should not care about the prediction not being calibrated; I'm just saying that R2 does not measure calibration.
If you use the training data as is, which is then scale internally in big_spLinReg()
but unscale to report the effects, then you should not have to scale the test data.
Thank you again for your response. Can you also tell what do you mean by that "R2 does not measure calibration"?
As I said, correlation is invariant from linear transformations, e.g.
x <- rnorm(100)
y <- rnorm(100)
cor(x, y)
x2 <- runif(1) * x + runif(1)
cor(x2, y)
So it cannot be used to measure calibration.
We are getting out-of-scope of the issues here.