ComBat-seq
ComBat-seq copied to clipboard
How to adjust by multiple variables using ComBat-Seq?
Dear @zhangyuqing ,
I am trying to adjust my RNA data using ComBat-Seq since I realised that there are 3 batches that I need to adjust for:
- Place (2 levels: place 1, place 2)
- Library Preparation Date (16 levels - different dates)
- Type of tube (2 levels: A, B)
I have 960 samples and around 62000 genes.
In my biological matrix, I have: Age, Sex, Group (cases,controls..) and WBC counts.
biol_mat = model.matrix(~Age + as.factor(Sex) + as.factor(Group) + LYMPH + MONO + NEUT, data=phenotype)
In the tutorial of Combat-Seq
appears how to adjust by 1 variable but it doesn't tell you how to adjust by more than 1.
I have seen a lot of posts using combat
that the only way is to adjust by 1 variable and then, with those results, adjust again by the 2nd variable and so on.
That would be:
Adjust by library prep.
raw.cts_adjustedLibPrep <- ComBat_seq(counts = raw_cts_matrix, batch=batch_libraryprep, group=NULL, covar_mod = biol_mat)
Adjust by library prep + type of tube.
raw.cts_adjusted_LibPrep_TypeTube <- ComBat_seq(counts = raw.cts_adjustedLibPrep, batch=batch_type_tube, group=NULL, covar_mod = biol_mat)
Adjust by library prep + type of tube + place
raw.cts_adjusted_LibPrep_TypeTube_Place <- ComBat_seq(counts = raw.cts_adjusted_LibPrep_TypeTube, batch=batch_place, group=NULL, covar_mod = biol_mat)
For the first adjustment (library prep) it takes around 15min, but for the second... it has been running for more than 2 days.. I stopped it and launch it again, changing the order of the adjustment but it is taking too much time and it seems that something is wrong...
Here I attach a similar example of my data:
Covariates (variables to adjust + biological information)
set.seed(11344)
covariates_df <- data.frame(
Sample = paste0("A", seq(1,960)),
Age = sample(seq(18, 70), size = 960, replace = TRUE),
Group = sample(seq(1, 3), size = 960, replace = TRUE),
Sex = sample(seq(1, 2), size = 960, replace = TRUE),
WBC = sample(seq(0.5, 35, by=0.3), size = 960, replace = TRUE),
Place = sample(seq(1, 2), size = 960, replace = TRUE),
Tubes = sample(seq(1, 2), size = 960, replace = TRUE),
LibPrep = sample(seq(1, 16), size = 960, replace = TRUE)
)
rownames(covariates_df) <- covariates_df$Sample
categorical_variables <- c("Sample", "Group", "Sex", "Place", "Tubes", "LibPrep")
covariates_df[,categorical_variables] <- lapply(covariates_df[,categorical_variables],as.factor)
>str(covariates_df)
'data.frame': 960 obs. of 8 variables:
$ Sample : Factor w/ 960 levels "A1","A10","A100",..: 1 112 223 334 445 556 667 778 889 2 ...
$ Age : int 40 66 37 40 25 42 46 51 45 61 ...
$ Group : Factor w/ 3 levels "1","2","3": 2 2 2 3 2 2 3 3 3 3 ...
$ Sex : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 1 2 1 ...
$ WBC : num 12.8 29.9 25.7 0.8 30.8 5.9 30.5 19.1 2.9 12.2 ...
$ Place : Factor w/ 2 levels "1","2": 1 1 2 1 2 1 2 2 1 2 ...
$ Tubes : Factor w/ 2 levels "1","2": 1 1 2 1 2 2 2 2 1 2 ...
$ LibPrep: Factor w/ 16 levels "1","2","3","4",..: 14 1 13 4 10 5 1 8 7 5 ...
The 3 variables that I want to adjust for will be: Place, Tubes and LibPrep. On the other hand, the biological information that I want to preserve would be Age, Group, Sex and WBC.
RNA raw counts
exp_df <- data.frame(replicate(960,sample(0:5216979,66000,rep=TRUE)))
colnames(exp_df) <- covariates_df$Sample
rownames(exp_df) <- paste0("Gene", rownames(exp_df))
exp_df_mat <- as.matrix(exp_df)
> str(exp_df)
'data.frame': 66000 obs. of 960 variables:
$ A1 : int 3687756 4638393 4315502 316404 2209492 831342 4261323 1283200 1522808 1088673 ...
$ A2 : int 4645779 3495763 4782397 2869628 2402288 1012125 1267979 1361720 1919367 4859438 ...
$ A3 : int 2883976 4770076 228011 915940 19038 4650368 112632 1316246 1926498 484384 ...
$ A4 : int 3496840 3676764 2763012 2723528 944224 3809955 5054054 1122139 116722 4090191 ...
$ A5 : int 3140122 650422 2075888 2987814 1656462 2863317 155120 1086391 3163073 625809 ...
$ A6 : int 4084796 1932277 3491059 4898410 4183070 4470479 4193882 928271 3282841 4418068 ...
$ A7 : int 765797 3153554 5075853 4775395 3582194 4274642 1530455 1433179 4757168 2209519 ...
$ A8 : int 1652346 3505656 111027 3170748 5087383 912180 2693545 1907366 3135340 548296 ...
$ A9 : int 441053 5132021 1857530 2413007 1218852 3614836 4388581 106105 3270886 3840996 ...
$ A10 : int 995597 2076013 1576689 1857073 1435772 3788330 3983860 816969 733090 1357226 ...
$ A11 : int 4944929 5182067 4415296 3224862 145068 3533346 4174175 2406340 4572529 4820674 ...
$ A12 : int 2731408 1896439 5063233 4621405 2686720 1507796 4764591 887296 2257893 384470 ...
The biological matrix would be:
biol_mat = model.matrix(~Age + as.factor(Sex) + as.factor(Group) + WBC, data=covariates_df)
head(biol_mat)
(Intercept) Age as.factor(Sex)2 as.factor(Group)2 as.factor(Group)3 WBC
A1 1 40 0 1 0 12.8
A2 1 66 1 1 0 29.9
A3 1 37 1 1 0 25.7
A4 1 40 0 0 1 0.8
A5 1 25 0 1 0 30.8
A6 1 42 1 1 0 5.9
Could you kindly help me, please? I have tried all the possibilities and I do not really know if it is something wrong my code or it is because combat_seq
cannot handle too much amount of data.
Thanks very much in advance Regards