ComBat-seq icon indicating copy to clipboard operation
ComBat-seq copied to clipboard

How to adjust by multiple variables using ComBat-Seq?

Open evaesquinas opened this issue 1 year ago • 3 comments

Dear @zhangyuqing ,

I am trying to adjust my RNA data using ComBat-Seq since I realised that there are 3 batches that I need to adjust for:

  • Place (2 levels: place 1, place 2)
  • Library Preparation Date (16 levels - different dates)
  • Type of tube (2 levels: A, B)

I have 960 samples and around 62000 genes.

In my biological matrix, I have: Age, Sex, Group (cases,controls..) and WBC counts.

biol_mat = model.matrix(~Age + as.factor(Sex) + as.factor(Group) + LYMPH + MONO + NEUT, data=phenotype)

In the tutorial of Combat-Seq appears how to adjust by 1 variable but it doesn't tell you how to adjust by more than 1.

I have seen a lot of posts using combat that the only way is to adjust by 1 variable and then, with those results, adjust again by the 2nd variable and so on.

That would be:

Adjust by library prep.

raw.cts_adjustedLibPrep <- ComBat_seq(counts = raw_cts_matrix, batch=batch_libraryprep, group=NULL, covar_mod = biol_mat)

Adjust by library prep + type of tube.

raw.cts_adjusted_LibPrep_TypeTube <- ComBat_seq(counts = raw.cts_adjustedLibPrep, batch=batch_type_tube, group=NULL, covar_mod = biol_mat)

Adjust by library prep + type of tube + place

 raw.cts_adjusted_LibPrep_TypeTube_Place <- ComBat_seq(counts = raw.cts_adjusted_LibPrep_TypeTube, batch=batch_place, group=NULL, covar_mod = biol_mat)

For the first adjustment (library prep) it takes around 15min, but for the second... it has been running for more than 2 days.. I stopped it and launch it again, changing the order of the adjustment but it is taking too much time and it seems that something is wrong...

Here I attach a similar example of my data:

Covariates (variables to adjust + biological information)

set.seed(11344)
covariates_df <- data.frame(
  Sample = paste0("A", seq(1,960)),
  Age = sample(seq(18, 70), size = 960, replace = TRUE),
  Group = sample(seq(1, 3), size = 960, replace = TRUE),
  Sex = sample(seq(1, 2), size = 960, replace = TRUE),
  WBC = sample(seq(0.5, 35, by=0.3), size = 960, replace = TRUE),
  Place = sample(seq(1, 2), size = 960, replace = TRUE),
  Tubes = sample(seq(1, 2), size = 960, replace = TRUE),
  LibPrep = sample(seq(1, 16), size = 960, replace = TRUE)
)
rownames(covariates_df) <- covariates_df$Sample

categorical_variables <- c("Sample", "Group", "Sex", "Place", "Tubes", "LibPrep")
covariates_df[,categorical_variables] <- lapply(covariates_df[,categorical_variables],as.factor)


>str(covariates_df)

'data.frame':	960 obs. of  8 variables:
 $ Sample : Factor w/ 960 levels "A1","A10","A100",..: 1 112 223 334 445 556 667 778 889 2 ...
 $ Age    : int  40 66 37 40 25 42 46 51 45 61 ...
 $ Group  : Factor w/ 3 levels "1","2","3": 2 2 2 3 2 2 3 3 3 3 ...
 $ Sex    : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 1 2 1 ...
 $ WBC    : num  12.8 29.9 25.7 0.8 30.8 5.9 30.5 19.1 2.9 12.2 ...
 $ Place  : Factor w/ 2 levels "1","2": 1 1 2 1 2 1 2 2 1 2 ...
 $ Tubes  : Factor w/ 2 levels "1","2": 1 1 2 1 2 2 2 2 1 2 ...
 $ LibPrep: Factor w/ 16 levels "1","2","3","4",..: 14 1 13 4 10 5 1 8 7 5 ...

The 3 variables that I want to adjust for will be: Place, Tubes and LibPrep. On the other hand, the biological information that I want to preserve would be Age, Group, Sex and WBC.

RNA raw counts

exp_df <- data.frame(replicate(960,sample(0:5216979,66000,rep=TRUE)))
colnames(exp_df) <- covariates_df$Sample
rownames(exp_df) <- paste0("Gene", rownames(exp_df))
exp_df_mat <- as.matrix(exp_df)

> str(exp_df)
'data.frame':	66000 obs. of  960 variables:
 $ A1  : int  3687756 4638393 4315502 316404 2209492 831342 4261323 1283200 1522808 1088673 ...
 $ A2  : int  4645779 3495763 4782397 2869628 2402288 1012125 1267979 1361720 1919367 4859438 ...
 $ A3  : int  2883976 4770076 228011 915940 19038 4650368 112632 1316246 1926498 484384 ...
 $ A4  : int  3496840 3676764 2763012 2723528 944224 3809955 5054054 1122139 116722 4090191 ...
 $ A5  : int  3140122 650422 2075888 2987814 1656462 2863317 155120 1086391 3163073 625809 ...
 $ A6  : int  4084796 1932277 3491059 4898410 4183070 4470479 4193882 928271 3282841 4418068 ...
 $ A7  : int  765797 3153554 5075853 4775395 3582194 4274642 1530455 1433179 4757168 2209519 ...
 $ A8  : int  1652346 3505656 111027 3170748 5087383 912180 2693545 1907366 3135340 548296 ...
 $ A9  : int  441053 5132021 1857530 2413007 1218852 3614836 4388581 106105 3270886 3840996 ...
 $ A10 : int  995597 2076013 1576689 1857073 1435772 3788330 3983860 816969 733090 1357226 ...
 $ A11 : int  4944929 5182067 4415296 3224862 145068 3533346 4174175 2406340 4572529 4820674 ...
 $ A12 : int  2731408 1896439 5063233 4621405 2686720 1507796 4764591 887296 2257893 384470 ...

The biological matrix would be: biol_mat = model.matrix(~Age + as.factor(Sex) + as.factor(Group) + WBC, data=covariates_df)

head(biol_mat)

(Intercept) Age as.factor(Sex)2 as.factor(Group)2 as.factor(Group)3  WBC
A1           1  40               0                 1                 0 12.8
A2           1  66               1                 1                 0 29.9
A3           1  37               1                 1                 0 25.7
A4           1  40               0                 0                 1  0.8
A5           1  25               0                 1                 0 30.8
A6           1  42               1                 1                 0  5.9

Could you kindly help me, please? I have tried all the possibilities and I do not really know if it is something wrong my code or it is because combat_seq cannot handle too much amount of data.

Thanks very much in advance Regards

evaesquinas avatar Jul 26 '22 14:07 evaesquinas