microeco
microeco copied to clipboard
LEfSe's results look different than before
Hi, I used the microeco package for LEfSe analysis, but the result is different from the package of LEfSe result, and the LDA value is much larger, is it because I did not pay attention to other details?
Here is the R code I analyzed with the microeco package:
library("microeco")
library("file2meco")
library("tidyverse")
library("magrittr")
ds_comp <- mpa2meco("abundance_table.txt",
sample_table = group.txt, auto_tidy = TRUE)
set.seed(123)
lefse_res_all <- trans_diff$new(dataset = ds_comp,
method = "lefse",
p_adjust_method = "BH",
group = "group",
alpha = 1, remove_unknown = F)
lefse_res_all$res_diff %<>% filter (LDA > 2)
write.table(lefse_res_all$res_diff, "microeco_lefse.txt", row.names = F, quote = F, sep = '\t')
Here is the shell code I analyzed with the LEfSe package:
python lefse/format_input.py abundance_table_add.group.txt lefse.in -u 1 -c 2 -o 1000000
python lefse/run_lefse.py lefse.in lefse.out.txt
For example, the LDA value of c__Clostridia in the microeco package is 6.97, but the result in the LEfSe package is 4.9. I don't know why, butI think I need some help ,please.
Thank you!
Hi,
The probable cause of the issue is in the data normalization. The "-o 1000000" in "python format_input.py" command means that the maximum value after normalization will not exceed 1,000,000. This can be seen from the data in the "lefse.in" file. I have verified that again. In the LEfSe method within the microeco package, the approach was to directly multiply the abundances in the taxa_abund of the microtable object by 1,000,000. Therefore, when we use the cal_abund function to calculate the taxa_abund list, the results are typically between 0 and 1, and multiplying by 1,000,000 still does not exceed 1,000,000.
However, in this case, the input data is already in percentage form, and the abundances in taxa_abund generated by reading with mpa2meco are also in percentages, which leads to such result.
There are two solutions:
- When using mpa2meco to read in, use the
relparameter to convert the abundances in taxa_abund to relative abundances between 0 and 1.
ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", rel = TRUE, auto_tidy = TRUE)
lefse_res_all_meco2 <- trans_diff$new(dataset = ds_comp,
method = "lefse", p_adjust_method = "none", alpha = 0.05,
group = "group", remove_unknown = F)
- The second way is to change
lefse_normparameter from 1000000 to 10000. It is the recommended solution.
ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", auto_tidy = TRUE)
lefse_res_all_meco3 <- trans_diff$new(dataset = ds_comp, lefse_norm = 10000,
method = "lefse", p_adjust_method = "none", alpha = 0.05,
group = "group", remove_unknown = F)
In addition, the default alpha in python lefse is 0.05 and there is no p value adjustment in it, so I set p_adjust_method = "none", alpha = 0.05 here.
Thanks for your issue! I will try to optimize the lefse_norm parameter to make it in line with the python lefse.
Hi, The probable cause of the issue is in the data normalization. The "-o 1000000" in "python format_input.py" command means that the maximum value after normalization will not exceed 1,000,000. This can be seen from the data in the "lefse.in" file. I have verified that again. In the LEfSe method within the microeco package, the approach was to directly multiply the abundances in the taxa_abund of the microtable object by 1,000,000. Therefore, when we use the
cal_abundfunction to calculate thetaxa_abundlist, the results are typically between 0 and 1, and multiplying by 1,000,000 still does not exceed 1,000,000. However, in this case, the input data is already in percentage form, and the abundances intaxa_abundgenerated by reading with mpa2meco are also in percentages, which leads to such result.There are two solutions:
- When using mpa2meco to read in, use the
relparameter to convert the abundances in taxa_abund to relative abundances between 0 and 1.ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", rel = TRUE, auto_tidy = TRUE) lefse_res_all_meco2 <- trans_diff$new(dataset = ds_comp, method = "lefse", p_adjust_method = "none", alpha = 0.05, group = "group", remove_unknown = F)
- The second way is to change
lefse_normparameter from 1000000 to 10000. It is the recommended solution.ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", auto_tidy = TRUE) lefse_res_all_meco3 <- trans_diff$new(dataset = ds_comp, lefse_norm = 10000, method = "lefse", p_adjust_method = "none", alpha = 0.05, group = "group", remove_unknown = F)In addition, the default alpha in python lefse is 0.05 and there is no p value adjustment in it, so I set
p_adjust_method = "none", alpha = 0.05here.
Hi, The probable cause of the issue is in the data normalization. The "-o 1000000" in "python format_input.py" command means that the maximum value after normalization will not exceed 1,000,000. This can be seen from the data in the "lefse.in" file. I have verified that again. In the LEfSe method within the microeco package, the approach was to directly multiply the abundances in the taxa_abund of the microtable object by 1,000,000. Therefore, when we use the
cal_abundfunction to calculate thetaxa_abundlist, the results are typically between 0 and 1, and multiplying by 1,000,000 still does not exceed 1,000,000. However, in this case, the input data is already in percentage form, and the abundances intaxa_abundgenerated by reading with mpa2meco are also in percentages, which leads to such result.There are two solutions:
- When using mpa2meco to read in, use the
relparameter to convert the abundances in taxa_abund to relative abundances between 0 and 1.ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", rel = TRUE, auto_tidy = TRUE) lefse_res_all_meco2 <- trans_diff$new(dataset = ds_comp, method = "lefse", p_adjust_method = "none", alpha = 0.05, group = "group", remove_unknown = F)
- The second way is to change
lefse_normparameter from 1000000 to 10000. It is the recommended solution.ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", auto_tidy = TRUE) lefse_res_all_meco3 <- trans_diff$new(dataset = ds_comp, lefse_norm = 10000, method = "lefse", p_adjust_method = "none", alpha = 0.05, group = "group", remove_unknown = F)In addition, the default alpha in python lefse is 0.05 and there is no p value adjustment in it, so I set
p_adjust_method = "none", alpha = 0.05here.
Thanks for your prompt feedback. I have another question, what is the method of calculating the direction of enrichment between groups? Median or mean? Thanks!
Median. Same with the lefse python version.
Median. Same with the lefse python version.
I ran the code the way you recommended. I found at the species level that the direction of enrichment is very different from the python version, and it seems that the python version of the calculation is more consistent with using the mean.
Could you please attach an example that I can check it out?
Could you please attach an example that I can check it out?
Yes, of course, as shown in the attachment: example.zip
#R code
ds_comp <- mpa2meco("example/abundance_table.txt", sample_table = "example/group.txt", auto_tidy = TRUE)
lefse_res_all <- trans_diff$new(dataset = ds_comp,
lefse_norm = 10000,
method = "lefse",
p_adjust_method = "none",
alpha = 0.05,
group = "group",
remove_unknown = F)
lefse_res_all$res_diff %<>% filter (LDA > 2)
write.table(lefse_res_all$res_diff, "example/microeco_lefse.txt", row.names = F, quote = F, sep = '\t')
#python version
python example/lefse/format_input.py example/abundance_table_add.group.txt example/lefse.in -u 1 -c 2 -o 1000000
python example/lefse/run_lefse.py example/lefse.in example/lefse.out.txt
For example,"s__Dialister_sp_CAG_357、s__Eubacterium_siraeum、s__Hungatella_hathewayi" are enriched in the Healthy group in the python version, but enriched in the Case group in the R version. Thanks!
Hi. I find that the median of those species is 0 for both case and healthy group, which lead to a chaotic comparison (maybe depend on the group order) in microeco lefse. I will fix it. Thanks.
For s__Dialister_sp_CAG_357 and s__Eubacterium_siraeum, the median in two groups are 0. But for s__Hungatella_hathewayi, case has a higher median than healthy (0), while healthy has a higher mean than case as you show. I think since we used non-parametric test, we should preferentially use median in all the result. How do you think so?