metacache icon indicating copy to clipboard operation
metacache copied to clipboard

Scientific notation in abundance file result in rounding errors

Open donovan-h-parks opened this issue 9 months ago • 5 comments

Hi. We've run into a small issue that we are hoping can be fixed in the next release. The abundance profile produced with the -abundances flag reports pair counts in scientific notation when numbers get large, e.g.:

# query summary: number of queries mapped per taxon
# rank 	name 	taxid 	number of reads 	abundance
domain 	Archaea 	439684927 	461 	0.0130496%
domain 	Bacteria 	609216830 	1.05068e+06 	29.7417%

This can result in small errors due to rounding. For example, in this case there is really 1050675 Bacterial read pairs, but it gets rounded up to 1050680. While having 5 extra read pairs is minor in terms of the resulting abundance estimates it makes it challenging to track the fait of all reads. In our code, we have a check that the number of input reads is equal to the number of reads in the MetaCache abundance profile (including unclassified). This is just a unit test to ensure our parsing is correct and that no reads are lost during any manipulation of data, but, more generally, not being able to account for all reads is a bit scary.

I imagine the intent is for this profile to produced integers, so am hoping this can be fixed in the next release. Thanks.

donovan-h-parks avatar Sep 11 '23 21:09 donovan-h-parks