bedtools2 icon indicating copy to clipboard operation
bedtools2 copied to clipboard

scientific notation for large numbers of reads

Open map2085 opened this issue 8 years ago • 2 comments

command: bedtools genomecov -d -g genome.size -ibam in.bam > out.cov

genome.size consists of a single RNA "myRNA", length = 8 kb nucleotides.

in.bam contains > 15 million reads aligned to "myRNA".

For some positions on the RNA, there is coverage > 1 million. But in the coverage file out.cov, BEDTOOLS uses scientific notation for the coverage at position with > 1million depth.

ex:

... myRNA 1238 1.06464e+06 myRNA 1239 1.06471e+06 myRNA 1240 1.06482e+06 myRNA 1241 1.06494e+06 ...

Can BEDTOOLS please be corrected to output the full, exact integer, NOT in exponent notation?

Thank you

map2085 avatar Feb 24 '16 19:02 map2085

Yep, I will look into it.

arq5x avatar Feb 29 '16 17:02 arq5x

Any update on this issue?

I also think writing large numbers as exact integers is needed, at least as an option.

hd00ljy avatar Jun 16 '22 00:06 hd00ljy

@map2085 @hd00ljy Hi, I'm having the same problem with you. Did you find any way around?

camelest avatar Oct 21 '22 01:10 camelest

@camelest I tried bamCoverage in deepTools. It generates binned bigwigs, which were sufficient for my purpose. But if you need single bp resolution, the deepTools would not be the option

hd00ljy avatar Oct 21 '22 05:10 hd00ljy

@hd00ljy Thank you so much for the information. I will give it a try!

camelest avatar Oct 21 '22 08:10 camelest

@arq5x . what precision do we want for default here? 4, so like 123.4343 ?

brentp avatar Nov 28 '22 21:11 brentp

@brentp @arq5x I think we want something like 1,064,643 (instead of 1.06464e+06). (I also had similar problems mentioned here

By the way in a different situation when I perform bedtools merge -c 5 -o sum -I input.bam even if the sum exceeds > 1million, I don't find any scientific notation. I'm wondering where the difference comes.

camelest avatar Nov 28 '22 23:11 camelest

oh. right. I was thinking this was a float. I will make a PR. Thanks for following up.

brentp avatar Nov 29 '22 00:11 brentp

Thank you so much for your response!

camelest avatar Nov 29 '22 00:11 camelest

bedtools.static.gz

Hi, care to try out this binary to see if it resolves your issue? Just gunzip, chmod +x and then use as ./bedtools.static genomecov ...

brentp avatar Nov 29 '22 22:11 brentp

@brentp Thank you so much for your prompt response. I checked it but somehow still found "e"... Is it because I'm applying -5 options?

./bedtools.static genomecov -ibam input.bam -5 -bg -strand + > output.bedGraph grep "e" output.bedGraph chr11 65499044 65499045 1.0946e+06 chr19 48965308 48965309 1.24953e+07

camelest avatar Nov 30 '22 16:11 camelest

Thanks for following up. Indeed, I see why the fix doesn't work. (The ternary must have a fixed type and it still chooses float). I'll come up with a test-case and see how to resolve this.

brentp avatar Nov 30 '22 17:11 brentp

so we can adjust this by messing with the fmtflags. Here is the test code using the number above from @camelest output is shown after // comment.

#include <cstdint>
#include <iostream>

int main() {
    double f = 12495300.0;
    std::cout << f << std::endl; // 1.24953e+07

    std::cout.setf(std::ios_base::fixed);
    std::cout << f << std::endl; // 12495300.000000

    std::cout.precision(0);
    std::cout << f << std::endl; // 123000000
}

So, we can safely set to std::cout.setf(std::ios_base::fixed); which avoids the scientific notation. And when scale == 1.0, I think we can't universally set precison(0), but I can add that (and restore default) in functions that should write integers.

@arq5x you ok with this approach?

I would use std::cout.setf(std::ios_base::fixed); in genomeCoverageMain.cpp and the apply precision(0) and restore default in the appropriate functions.

Or, I could apply fixed and restore default in each function as well.

brentp avatar Nov 30 '22 21:11 brentp

There are sections like this in ReportGenomeCoverage:

   // loop through the depths for the entire genome
    // and report the number and fraction of bases in
    // the entire genome that are at said depth.
    for (histMap::iterator genomeDepthIt = genomeHist.begin(); genomeDepthIt != genomeHist.end(); ++genomeDepthIt) {
        int depth = genomeDepthIt->first;
        CHRPOS numBasesAtDepth = genomeDepthIt->second;

        cout << "genome" << "\t" << depth << "\t" << numBasesAtDepth << "\t"
            << genomeSize << "\t" << (float) ((float)numBasesAtDepth / (float)genomeSize) << endl;
    }

where we want the floating (or scientfic format) retained. For what I proposed above, we'd always get the float (with apparently 6 decimals of precision) but depending on how we do it, we could get a scientific format in some cases.

brentp avatar Nov 30 '22 21:11 brentp

I wonder if moving to printf or separate court calls would be better so that we can have fine-scale control over the precision for integers (such as depth) and floats (such as fraction of bases at depth)

arq5x avatar Nov 30 '22 22:11 arq5x

Integers are printed as expected. The problem is the scale parameter which turns ints into floats even with it is 1.0.

This program:

#include <cstdint>
#include <iostream>

int main() {
        double f = 12495300.0;
        int64_t i = 880000000000;
        std::cout << f << " i:" << i << std::endl; // 1.24953e+07

        std::cout.setf(std::ios_base::fixed);
        std::cout << f << " i:" << i << std::endl; // 12495300.000000

    std::cout.precision(0);
        std::cout << f << " i:" << i << std::endl; // 123000000
}

produces this output:

1.24953e+07 i:880000000000
12495300.000000 i:880000000000
12495300 i:880000000000

brentp avatar Nov 30 '22 22:11 brentp

@brentp @arq5x Thank you so much for following up. It would be really nice if we can choose to output integer as in "precision(0)".

camelest avatar Dec 03 '22 01:12 camelest

I will make a PR that sets precision(0) where appropriate and when scale == 0. It will reset precision to the default after each function call.

brentp avatar Dec 05 '22 16:12 brentp

@camelest . I added a test this time for the latest fix, but if you could verify that this works on your data, it would be much appreciated.

bedtools.static.gz

brentp avatar Dec 05 '22 18:12 brentp

@brentp I tested it and it worked perfect! Thank you so much for your quick solution.

./bedtools.static genomecov -ibam input.bam -5 -bg -strand + > output.bedGraph grep "65499044" output.bedGraph chr11 65499044 65499045 1094599

camelest avatar Dec 06 '22 11:12 camelest