Incorrect read length distribution produced when read length close to reference length

Open amandawarr opened this issue 4 years ago • 0 comments

Describe the bug We are trying to simulate "very good" sequencing data for an amplicon that is almost the length of the ~15kb genome we are sequencing, but the read length distributions coming out of badread are not as expected. With a mean of 14kb and a low std (10) we get a histogram like this:

And with a high std (13000) we get a distribution like this:

The histograms printed to stdout during the run appear as expected, but the data out of the end doesn't reflect them.

To reproduce The genome used is here: https://www.ncbi.nlm.nih.gov/nuccore/1695217306 The command used is: badread simulate --reference Lelystad.fasta --quantity 800x --length 14000,10 --error_model random --qscore_model ideal --glitches 0,0,0 --junk_reads 0 --random_reads 0 --chimeras 0 --identity 95,100,4 --start_adapter_seq "" --end_adapter_seq ""

Jun 29 '21 11:06 amandawarr