taxpasta
taxpasta copied to clipboard
[Feature] Allow to set threshold for compositionality check of Bracken results
Checklist
- [X] There are no similar issues or pull requests for this yet.
Problem
I would like to merge multiple Bracken output tables. Prior to merging them, taxpasta
performs a check whether the relative abundances that Bracken reports add up to 1. The default threshold for rounding errors is 0.001. However, from my 32 profiled samples, the sum of all relative abundances is less than 0.999 for 25 of them and taxpasta
throws a critical error. Since all of the samples have a total relative abundance of 0.995, I would personally consider the samples perfectly fine for merging and would like taxpasta
to continue anyway.
Solution
I am currently not aware that there is an option to set this threshold oneself to avoid taxpasta
failing for slightly sub-optimal data. Therefore, I would suggest to add a commandline option to allow to set the parameter for the user.
Alternatives
No response
Anything else?
No response
Makes perfect sense, thanks for the report 🙂 I would even consider this a bug since the threshold is chosen in a way that it fails on real world examples.
Would you be able to share any of those profiles so that we can include it in the public test data?
Okay, I'm splitting this into a bug and an enhancement issue. This will remain the feature request and the bug is tracked in #75.
Thinking a bit more about it, it makes sense to me to have some kind of internal configuration system to easily change these hardcoded values. However, I'm not sure we actually want users to be able to modify these (easily). Taxpasta is simply supposed to work with all profiles generated by profilers. No configuration necessary.
I'll leave this request open for now but may deny it after talking to the others.
If the function of taxpasta is to merge and convert formats, I think it should only throw a warning in the case that the sum of the composition is less than 1.0, not a critical error.
I have been unable to join my bracken output files with bracken thresholds of less than -t 10
. In this way, taxpasta is actually constraining my analysis.
Example critical error message for bracken outputs computed with -t 1
:
[CRITICAL] Error in sample 'FID953321' with profile 'results/03_brackenGenus/FID953321.bracken'. [CRITICAL] schema_context column check check_number failure_case index 0 Column fraction_total_reads compositionality 2 False None
I can only alleviate this by increasing the bracken -t
parameter to 10
.
I agree that taxpasta should not be constraining to analyses. At the same time, we need to ensure that the data that we read in is in the expected format. We can only do this via validation. I will look into making the compositionality checks even more lenient, though.
@BenjaminJPerry would you be able to share an example Bracken profile that fails with me publicly?
I've found a bracken file (a low coverage one, out of about 2k samples) with a sum of only 97.6%:
$ cut -f 7 ../../output/taxonid/bracken/atsphere~CFA0205m~k150_bracken.txt | tail -n +2 | sum
.97628
Can share it with you if needed, but it's a lot of 0.0003s, which are probably 0.00033333333333333..., which I guess sums to ~0.03 over about 10^3 taxa (the file has ~2k taxa), plus all the other rounding errors.
Yes, probably not a very useful sample, but would ideally not lead to a crash. Can we bump the limit to say 5%, or at least make it configurable from the CLI?
If you want to get really fancy, it should be pretty easy to calculate the expected deviation from 1 when truncating at 5 digits based on the read count column.
Yes, if you can share it, that'd be fantastic. And by share, I mean openly such that we can make it part of the test suite.
Here you are @Midnighter atsphere~CFA0205m~k150_bracken.txt
And yes, use it however you want. One could pretty easily make a pathological case though: one read per taxa for 1/0.000044444444
taxa. would give the largest deviation from one, no? (or any other recurring number)
Interestingly, when I take the new_est_reads
column in that profile and divide it by its sum, the fractions deviate quite a bit from the column fraction_total_reads
(up to 8% differences between values) and in a way that I can not easily explain by rounding. I wonder if Bracken uses a different total number of reads internally taken from the Kraken2 profile and that's why the total fraction deviates so much from unity.
@kdm9 Can you share the associated Kraken2 profile with me, too, please?
Hi @Midnighter , do you need additional profile data on fixing this? I been encountered the same problem with my dataset.
How I call taxpasta merge
:
taxpasta merge -p bracken -o bracken_S1_virus.tsv
2613_pe_virus.bracken.txt 2612_pe_virus.bracken.txt
error message:
[02:24:49] CRITICAL Error in sample '2613_pe_virus.bracken' with profile '2613_pe_virus.bracken.txt'. merge.py:419
CRITICAL schema_context column check check_number failure_case index merge.py:424
0 Column fraction_total_reads compositionality 2 False None
Here are the profiles that I use:
2612_pe_virus.bracken.txt 2613_pe_virus.bracken.txt
Here are the kraken2 profiles, I call braken
on this profiles with arguments bracken -r 100 -l S1 -t 0
, so they are on the virus stain level.
2612_pe_standardized.kraken2.report.txt 2613_pe_standardized.kraken2.report.txt
Hope these can be of help.
BTW, can we have a parameter to control the compositionality check? Maybe a double type ranged [0, 1], 0 to turn it off?
In #147, I've turned the compositionality errors into warnings. I hope that addresses most concerns voiced here.