microeco icon indicating copy to clipboard operation
microeco copied to clipboard

LEfSe's results look different than before

Open wfgui opened this issue 1 year ago • 9 comments

Hi, I used the microeco package for LEfSe analysis, but the result is different from the package of LEfSe result, and the LDA value is much larger, is it because I did not pay attention to other details?

Here is the R code I analyzed with the microeco package:

library("microeco")
library("file2meco")
library("tidyverse")
library("magrittr")

ds_comp <- mpa2meco("abundance_table.txt", 
                    sample_table = group.txt, auto_tidy = TRUE)
set.seed(123)
lefse_res_all <- trans_diff$new(dataset = ds_comp, 
                                method = "lefse", 
                                p_adjust_method = "BH", 
                                group = "group",
                                alpha = 1, remove_unknown = F)
lefse_res_all$res_diff %<>% filter (LDA > 2)
write.table(lefse_res_all$res_diff, "microeco_lefse.txt", row.names = F, quote = F, sep = '\t')

Here is the shell code I analyzed with the LEfSe package:

python lefse/format_input.py  abundance_table_add.group.txt  lefse.in -u 1 -c 2 -o 1000000
python lefse/run_lefse.py  lefse.in  lefse.out.txt

For example, the LDA value of c__Clostridia in the microeco package is 6.97, but the result in the LEfSe package is 4.9. I don't know why, butI think I need some help ,please.

Thank you!

example.zip

wfgui avatar Jul 19 '24 06:07 wfgui

Hi, The probable cause of the issue is in the data normalization. The "-o 1000000" in "python format_input.py" command means that the maximum value after normalization will not exceed 1,000,000. This can be seen from the data in the "lefse.in" file. I have verified that again. In the LEfSe method within the microeco package, the approach was to directly multiply the abundances in the taxa_abund of the microtable object by 1,000,000. Therefore, when we use the cal_abund function to calculate the taxa_abund list, the results are typically between 0 and 1, and multiplying by 1,000,000 still does not exceed 1,000,000. However, in this case, the input data is already in percentage form, and the abundances in taxa_abund generated by reading with mpa2meco are also in percentages, which leads to such result.

There are two solutions:

  1. When using mpa2meco to read in, use the rel parameter to convert the abundances in taxa_abund to relative abundances between 0 and 1.
ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", rel = TRUE, auto_tidy = TRUE)
lefse_res_all_meco2 <- trans_diff$new(dataset = ds_comp, 
						method = "lefse", p_adjust_method = "none", alpha = 0.05, 
						group = "group", remove_unknown = F)
  1. The second way is to change lefse_norm parameter from 1000000 to 10000. It is the recommended solution.
ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", auto_tidy = TRUE)
lefse_res_all_meco3 <- trans_diff$new(dataset = ds_comp, lefse_norm = 10000,
						method = "lefse", p_adjust_method = "none", alpha = 0.05, 
						group = "group", remove_unknown = F)

In addition, the default alpha in python lefse is 0.05 and there is no p value adjustment in it, so I set p_adjust_method = "none", alpha = 0.05 here.

ChiLiubio avatar Jul 19 '24 14:07 ChiLiubio

Thanks for your issue! I will try to optimize the lefse_norm parameter to make it in line with the python lefse.

ChiLiubio avatar Jul 19 '24 14:07 ChiLiubio

Hi, The probable cause of the issue is in the data normalization. The "-o 1000000" in "python format_input.py" command means that the maximum value after normalization will not exceed 1,000,000. This can be seen from the data in the "lefse.in" file. I have verified that again. In the LEfSe method within the microeco package, the approach was to directly multiply the abundances in the taxa_abund of the microtable object by 1,000,000. Therefore, when we use the cal_abund function to calculate the taxa_abund list, the results are typically between 0 and 1, and multiplying by 1,000,000 still does not exceed 1,000,000. However, in this case, the input data is already in percentage form, and the abundances in taxa_abund generated by reading with mpa2meco are also in percentages, which leads to such result.

There are two solutions:

  1. When using mpa2meco to read in, use the rel parameter to convert the abundances in taxa_abund to relative abundances between 0 and 1.
ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", rel = TRUE, auto_tidy = TRUE)
lefse_res_all_meco2 <- trans_diff$new(dataset = ds_comp, 
						method = "lefse", p_adjust_method = "none", alpha = 0.05, 
						group = "group", remove_unknown = F)
  1. The second way is to change lefse_norm parameter from 1000000 to 10000. It is the recommended solution.
ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", auto_tidy = TRUE)
lefse_res_all_meco3 <- trans_diff$new(dataset = ds_comp, lefse_norm = 10000,
						method = "lefse", p_adjust_method = "none", alpha = 0.05, 
						group = "group", remove_unknown = F)

In addition, the default alpha in python lefse is 0.05 and there is no p value adjustment in it, so I set p_adjust_method = "none", alpha = 0.05 here.

Hi, The probable cause of the issue is in the data normalization. The "-o 1000000" in "python format_input.py" command means that the maximum value after normalization will not exceed 1,000,000. This can be seen from the data in the "lefse.in" file. I have verified that again. In the LEfSe method within the microeco package, the approach was to directly multiply the abundances in the taxa_abund of the microtable object by 1,000,000. Therefore, when we use the cal_abund function to calculate the taxa_abund list, the results are typically between 0 and 1, and multiplying by 1,000,000 still does not exceed 1,000,000. However, in this case, the input data is already in percentage form, and the abundances in taxa_abund generated by reading with mpa2meco are also in percentages, which leads to such result.

There are two solutions:

  1. When using mpa2meco to read in, use the rel parameter to convert the abundances in taxa_abund to relative abundances between 0 and 1.
ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", rel = TRUE, auto_tidy = TRUE)
lefse_res_all_meco2 <- trans_diff$new(dataset = ds_comp, 
						method = "lefse", p_adjust_method = "none", alpha = 0.05, 
						group = "group", remove_unknown = F)
  1. The second way is to change lefse_norm parameter from 1000000 to 10000. It is the recommended solution.
ds_comp <- mpa2meco("abundance_table.txt", sample_table = "group.txt", auto_tidy = TRUE)
lefse_res_all_meco3 <- trans_diff$new(dataset = ds_comp, lefse_norm = 10000,
						method = "lefse", p_adjust_method = "none", alpha = 0.05, 
						group = "group", remove_unknown = F)

In addition, the default alpha in python lefse is 0.05 and there is no p value adjustment in it, so I set p_adjust_method = "none", alpha = 0.05 here.

Thanks for your prompt feedback. I have another question, what is the method of calculating the direction of enrichment between groups? Median or mean? Thanks!

wfgui avatar Jul 22 '24 01:07 wfgui

Median. Same with the lefse python version.

ChiLiubio avatar Jul 22 '24 02:07 ChiLiubio

Median. Same with the lefse python version.

I ran the code the way you recommended. I found at the species level that the direction of enrichment is very different from the python version, and it seems that the python version of the calculation is more consistent with using the mean.

wfgui avatar Jul 22 '24 05:07 wfgui

Could you please attach an example that I can check it out?

ChiLiubio avatar Jul 23 '24 02:07 ChiLiubio

Could you please attach an example that I can check it out?

Yes, of course, as shown in the attachment: example.zip

#R code
ds_comp <- mpa2meco("example/abundance_table.txt", sample_table = "example/group.txt", auto_tidy = TRUE)
lefse_res_all <- trans_diff$new(dataset = ds_comp, 
                                                lefse_norm = 10000,
						method = "lefse", 
                                                p_adjust_method = "none", 
                                                alpha = 0.05, 
						group = "group", 
                                                remove_unknown = F)
lefse_res_all$res_diff %<>% filter (LDA > 2)
write.table(lefse_res_all$res_diff, "example/microeco_lefse.txt", row.names = F, quote = F, sep = '\t') 

#python version
python example/lefse/format_input.py example/abundance_table_add.group.txt example/lefse.in -u 1 -c 2 -o 1000000
python example/lefse/run_lefse.py example/lefse.in example/lefse.out.txt

For example,"s__Dialister_sp_CAG_357、s__Eubacterium_siraeum、s__Hungatella_hathewayi" are enriched in the Healthy group in the python version, but enriched in the Case group in the R version. Thanks!

wfgui avatar Jul 23 '24 06:07 wfgui

Hi. I find that the median of those species is 0 for both case and healthy group, which lead to a chaotic comparison (maybe depend on the group order) in microeco lefse. I will fix it. Thanks.

ChiLiubio avatar Jul 23 '24 12:07 ChiLiubio

For s__Dialister_sp_CAG_357 and s__Eubacterium_siraeum, the median in two groups are 0. But for s__Hungatella_hathewayi, case has a higher median than healthy (0), while healthy has a higher mean than case as you show. I think since we used non-parametric test, we should preferentially use median in all the result. How do you think so?

ChiLiubio avatar Jul 23 '24 13:07 ChiLiubio