taxpasta icon indicating copy to clipboard operation
taxpasta copied to clipboard

[Feature] Allow to set threshold for compositionality check of Bracken results

Open alexhbnr opened this issue 1 year ago • 12 comments

Checklist

Problem

I would like to merge multiple Bracken output tables. Prior to merging them, taxpasta performs a check whether the relative abundances that Bracken reports add up to 1. The default threshold for rounding errors is 0.001. However, from my 32 profiled samples, the sum of all relative abundances is less than 0.999 for 25 of them and taxpasta throws a critical error. Since all of the samples have a total relative abundance of 0.995, I would personally consider the samples perfectly fine for merging and would like taxpasta to continue anyway.

Solution

I am currently not aware that there is an option to set this threshold oneself to avoid taxpasta failing for slightly sub-optimal data. Therefore, I would suggest to add a commandline option to allow to set the parameter for the user.

Alternatives

No response

Anything else?

No response

alexhbnr avatar Mar 12 '23 08:03 alexhbnr

Makes perfect sense, thanks for the report 🙂 I would even consider this a bug since the threshold is chosen in a way that it fails on real world examples.

Would you be able to share any of those profiles so that we can include it in the public test data?

Midnighter avatar Mar 12 '23 12:03 Midnighter

Okay, I'm splitting this into a bug and an enhancement issue. This will remain the feature request and the bug is tracked in #75.

Midnighter avatar Mar 12 '23 14:03 Midnighter

Thinking a bit more about it, it makes sense to me to have some kind of internal configuration system to easily change these hardcoded values. However, I'm not sure we actually want users to be able to modify these (easily). Taxpasta is simply supposed to work with all profiles generated by profilers. No configuration necessary.

I'll leave this request open for now but may deny it after talking to the others.

Midnighter avatar Mar 12 '23 14:03 Midnighter

If the function of taxpasta is to merge and convert formats, I think it should only throw a warning in the case that the sum of the composition is less than 1.0, not a critical error.

I have been unable to join my bracken output files with bracken thresholds of less than -t 10. In this way, taxpasta is actually constraining my analysis.

Example critical error message for bracken outputs computed with -t 1:

[CRITICAL] Error in sample 'FID953321' with profile 'results/03_brackenGenus/FID953321.bracken'. [CRITICAL] schema_context column check check_number failure_case index 0 Column fraction_total_reads compositionality 2 False None

I can only alleviate this by increasing the bracken -t parameter to 10.

BenjaminJPerry avatar Mar 24 '23 02:03 BenjaminJPerry

I agree that taxpasta should not be constraining to analyses. At the same time, we need to ensure that the data that we read in is in the expected format. We can only do this via validation. I will look into making the compositionality checks even more lenient, though.

Midnighter avatar Mar 24 '23 09:03 Midnighter

@BenjaminJPerry would you be able to share an example Bracken profile that fails with me publicly?

Midnighter avatar Mar 24 '23 09:03 Midnighter

I've found a bracken file (a low coverage one, out of about 2k samples) with a sum of only 97.6%:

$ cut -f 7 ../../output/taxonid/bracken/atsphere~CFA0205m~k150_bracken.txt | tail -n +2 | sum 
.97628

Can share it with you if needed, but it's a lot of 0.0003s, which are probably 0.00033333333333333..., which I guess sums to ~0.03 over about 10^3 taxa (the file has ~2k taxa), plus all the other rounding errors.

Yes, probably not a very useful sample, but would ideally not lead to a crash. Can we bump the limit to say 5%, or at least make it configurable from the CLI?

If you want to get really fancy, it should be pretty easy to calculate the expected deviation from 1 when truncating at 5 digits based on the read count column.

kdm9 avatar Apr 19 '23 18:04 kdm9

Yes, if you can share it, that'd be fantastic. And by share, I mean openly such that we can make it part of the test suite.

Midnighter avatar Apr 19 '23 19:04 Midnighter

Here you are @Midnighter atsphere~CFA0205m~k150_bracken.txt

And yes, use it however you want. One could pretty easily make a pathological case though: one read per taxa for 1/0.000044444444 taxa. would give the largest deviation from one, no? (or any other recurring number)

kdm9 avatar Apr 20 '23 07:04 kdm9

Interestingly, when I take the new_est_reads column in that profile and divide it by its sum, the fractions deviate quite a bit from the column fraction_total_reads (up to 8% differences between values) and in a way that I can not easily explain by rounding. I wonder if Bracken uses a different total number of reads internally taken from the Kraken2 profile and that's why the total fraction deviates so much from unity.

@kdm9 Can you share the associated Kraken2 profile with me, too, please?

Midnighter avatar Apr 29 '23 11:04 Midnighter

Hi @Midnighter , do you need additional profile data on fixing this? I been encountered the same problem with my dataset.

How I call taxpasta merge:

taxpasta merge -p bracken -o bracken_S1_virus.tsv
2613_pe_virus.bracken.txt 2612_pe_virus.bracken.txt

error message:

[02:24:49] CRITICAL Error in sample '2613_pe_virus.bracken' with profile '2613_pe_virus.bracken.txt'.                           merge.py:419
           CRITICAL   schema_context                column             check  check_number  failure_case index                  merge.py:424
                    0         Column  fraction_total_reads  compositionality             2         False  None

Here are the profiles that I use:

2612_pe_virus.bracken.txt 2613_pe_virus.bracken.txt

Here are the kraken2 profiles, I call braken on this profiles with arguments bracken -r 100 -l S1 -t 0, so they are on the virus stain level.

2612_pe_standardized.kraken2.report.txt 2613_pe_standardized.kraken2.report.txt

Hope these can be of help.

MajoroMask avatar Oct 26 '23 02:10 MajoroMask

BTW, can we have a parameter to control the compositionality check? Maybe a double type ranged [0, 1], 0 to turn it off?

MajoroMask avatar Oct 26 '23 02:10 MajoroMask

In #147, I've turned the compositionality errors into warnings. I hope that addresses most concerns voiced here.

Midnighter avatar Jun 08 '24 10:06 Midnighter