ngmlr icon indicating copy to clipboard operation
ngmlr copied to clipboard

NGMLR very slow on bovine nanopore reads

Open sdjebali opened this issue 4 years ago • 2 comments

Dear all,

First of all, thanks for this very nice development.

I just wanted to report the fact that on some quite heavy ONT runs from bovine, NGMLR followed by sort was very slow (about 4 days for 4 million reads).

And I was wondering if I was using the tool correctly (right parameters)?

I tried with the first 1 million reads like this: zcat $fastq | head -n 4000000 | ngmlr --presets ont -t 22 -r $genome | samtools sort -@ 6 -o $output and it took 5h23 to complete

I then tried with the second 1 million reads like this: zcat $fastq | tail -n+4000000 | head -n 4000000 | ngmlr --presets ont -t 22 -r $genome | samtools sort -@ 4 -o $output and it took 24h10 to complete

I am using NGMLR version 0.2.8 and samtools version 1.9, and here are the details about my machine : Linux tatum 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64 GNU/ 24 processors Linuxprocessor : 0 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz

Any advice would be warmly welcome?

Best, Sarah

sdjebali avatar Oct 11 '19 08:10 sdjebali

Thanks Sarah, do you have an average read length? Its likely but unfortunate that some of your 2nd patch reads are very long.. Thanks Fritz

fritzsedlazeck avatar Oct 15 '19 13:10 fritzsedlazeck

Indeed there seems to be a big read length difference between the two batches.

I ran Nanoplot on them and here are the results :

  • First 1 Million reads: General summary:
    Mean read length: 4,722.5 Mean read quality: 4.4 Median read length: 906.0 Median read quality: 4.2 Number of reads: 1,000,000.0 Read length N50: 14,404.0 Total bases: 4,722,479,679.0 Number, percentage and megabases of reads above quality cutoffs

Q5: 367454 (36.7%) 3015.3Mb Q7: 8 (0.0%) 0.1Mb Q10: 0 (0.0%) 0.0Mb Q12: 0 (0.0%) 0.0Mb Q15: 0 (0.0%) 0.0Mb Top 5 highest mean basecall quality scores and their read lengths 1: 7.0 (17272) 2: 7.0 (9848) 3: 7.0 (25242) 4: 7.0 (12091) 5: 7.0 (25093) Top 5 longest reads and their mean basecall quality score 1: 2210466 (3.6) 2: 1850945 (3.8) 3: 1772717 (3.6) 4: 1685671 (3.9) 5: 1563326 (3.9)

  • second 1 Million reads General summary:
    Mean read length: 13,668.0 Mean read quality: 11.1 Median read length: 13,451.0 Median read quality: 11.8 Number of reads: 1,000,000.0 Read length N50: 16,657.0 Total bases: 13,668,019,254.0 Number, percentage and megabases of reads above quality cutoffs

Q5: 963153 (96.3%) 13574.0Mb Q7: 937982 (93.8%) 13387.4Mb Q10: 781757 (78.2%) 10950.3Mb Q12: 446035 (44.6%) 6333.8Mb Q15: 165 (0.0%) 1.6Mb Top 5 highest mean basecall quality scores and their read lengths 1: 16.3 (2090) 2: 16.2 (243) 3: 16.1 (362) 4: 16.1 (570) 5: 16.1 (1509) Top 5 longest reads and their mean basecall quality score 1: 884004 (3.7) 2: 274368 (5.2) 3: 187850 (4.8) 4: 150969 (3.8) 5: 124444 (9.8)

so 13kb vs 4kb

If we still want to use NGMLR on these data, is there any option that can speed the process up?

Best, Sarah

sdjebali avatar Oct 16 '19 09:10 sdjebali