bedtools2
bedtools2 copied to clipboard
scientific notation for large numbers of reads
command: bedtools genomecov -d -g genome.size -ibam in.bam > out.cov
genome.size consists of a single RNA "myRNA", length = 8 kb nucleotides.
in.bam contains > 15 million reads aligned to "myRNA".
For some positions on the RNA, there is coverage > 1 million. But in the coverage file out.cov, BEDTOOLS uses scientific notation for the coverage at position with > 1million depth.
ex:
... myRNA 1238 1.06464e+06 myRNA 1239 1.06471e+06 myRNA 1240 1.06482e+06 myRNA 1241 1.06494e+06 ...
Can BEDTOOLS please be corrected to output the full, exact integer, NOT in exponent notation?
Thank you
Yep, I will look into it.
Any update on this issue?
I also think writing large numbers as exact integers is needed, at least as an option.
@map2085 @hd00ljy Hi, I'm having the same problem with you. Did you find any way around?
@camelest I tried bamCoverage in deepTools. It generates binned bigwigs, which were sufficient for my purpose. But if you need single bp resolution, the deepTools would not be the option
@hd00ljy Thank you so much for the information. I will give it a try!
@arq5x . what precision do we want for default here? 4, so like 123.4343
?
@brentp @arq5x I think we want something like 1,064,643 (instead of 1.06464e+06). (I also had similar problems mentioned here
By the way in a different situation when I perform
bedtools merge -c 5 -o sum -I input.bam
even if the sum exceeds > 1million, I don't find any scientific notation. I'm wondering where the difference comes.
oh. right. I was thinking this was a float. I will make a PR. Thanks for following up.
Thank you so much for your response!
Hi, care to try out this binary to see if it resolves your issue? Just gunzip, chmod +x and then use as ./bedtools.static genomecov ...
@brentp Thank you so much for your prompt response. I checked it but somehow still found "e"... Is it because I'm applying -5 options?
./bedtools.static genomecov -ibam input.bam -5 -bg -strand + > output.bedGraph
grep "e" output.bedGraph
chr11 65499044 65499045 1.0946e+06
chr19 48965308 48965309 1.24953e+07
Thanks for following up. Indeed, I see why the fix doesn't work. (The ternary must have a fixed type and it still chooses float). I'll come up with a test-case and see how to resolve this.
so we can adjust this by messing with the fmtflags. Here is the test code using the number above from @camelest output is shown after // comment.
#include <cstdint>
#include <iostream>
int main() {
double f = 12495300.0;
std::cout << f << std::endl; // 1.24953e+07
std::cout.setf(std::ios_base::fixed);
std::cout << f << std::endl; // 12495300.000000
std::cout.precision(0);
std::cout << f << std::endl; // 123000000
}
So, we can safely set to std::cout.setf(std::ios_base::fixed);
which avoids the scientific notation.
And when scale == 1.0
, I think we can't universally set precison(0)
, but I can add that (and restore default) in functions that should write integers.
@arq5x you ok with this approach?
I would use std::cout.setf(std::ios_base::fixed);
in genomeCoverageMain.cpp
and the apply precision(0)
and restore default in the appropriate functions.
Or, I could apply fixed and restore default in each function as well.
There are sections like this in ReportGenomeCoverage
:
// loop through the depths for the entire genome
// and report the number and fraction of bases in
// the entire genome that are at said depth.
for (histMap::iterator genomeDepthIt = genomeHist.begin(); genomeDepthIt != genomeHist.end(); ++genomeDepthIt) {
int depth = genomeDepthIt->first;
CHRPOS numBasesAtDepth = genomeDepthIt->second;
cout << "genome" << "\t" << depth << "\t" << numBasesAtDepth << "\t"
<< genomeSize << "\t" << (float) ((float)numBasesAtDepth / (float)genomeSize) << endl;
}
where we want the floating (or scientfic format) retained. For what I proposed above, we'd always get the float (with apparently 6 decimals of precision) but depending on how we do it, we could get a scientific format in some cases.
I wonder if moving to printf
or separate court calls would be better so that we can have fine-scale control over the precision for integers (such as depth) and floats (such as fraction of bases at depth)
Integers are printed as expected. The problem is the scale parameter which turns ints into floats even with it is 1.0.
This program:
#include <cstdint>
#include <iostream>
int main() {
double f = 12495300.0;
int64_t i = 880000000000;
std::cout << f << " i:" << i << std::endl; // 1.24953e+07
std::cout.setf(std::ios_base::fixed);
std::cout << f << " i:" << i << std::endl; // 12495300.000000
std::cout.precision(0);
std::cout << f << " i:" << i << std::endl; // 123000000
}
produces this output:
1.24953e+07 i:880000000000
12495300.000000 i:880000000000
12495300 i:880000000000
@brentp @arq5x Thank you so much for following up. It would be really nice if we can choose to output integer as in "precision(0)".
I will make a PR that sets precision(0) where appropriate and when scale == 0. It will reset precision to the default after each function call.
@camelest . I added a test this time for the latest fix, but if you could verify that this works on your data, it would be much appreciated.
@brentp I tested it and it worked perfect! Thank you so much for your quick solution.
./bedtools.static genomecov -ibam input.bam -5 -bg -strand + > output.bedGraph
grep "65499044" output.bedGraph
chr11 65499044 65499045 1094599