biofast icon indicating copy to clipboard operation
biofast copied to clipboard

Add BBTools Java implementation for fqcnt benchmark

Open bbushnell opened this issue 1 month ago • 1 comments

This PR adds BBTools FastqScan as a Java implementation for the fqcnt benchmark.

Implementation Details

  • Uses BBTools' FastqScan tool with multithreaded SIMD-accelerated parsing
  • Wrapper script: fqcnt_java_bbtools.sh
  • Output format matches biofast specification: <records>\t<bases>\t<qualities>

Testing

Tested with M_abscessus_HiSeq.fq (5,682,010 reads):

5682010	568201000	568201000

Requirements

  • Java 18+ required (for jdk.incubator.vector SIMD support)
  • Java 25 recommended for optimal performance
  • BBTools: git clone --depth=1 https://github.com/bbushnell/BBTools

About BBTools

BBTools is a comprehensive suite of bioinformatics tools developed at the Joint Genome Institute (JGI). FastqScan provides high-performance FASTQ parsing optimized for modern hardware.

Repository: https://github.com/bbushnell/BBTools

bbushnell avatar Dec 11 '25 22:12 bbushnell

Performance Note: FastqScan is fastest with larger files and BGZF compression

JVM Startup Overhead

Java has ~0.25s startup/JIT compilation overhead that dominates benchmarks on small files (like the 5.6M read test case). This overhead is:

  • Amortized on production-scale files (100M+ reads)
  • Irrelevant when called from Java code (JVM already running)

BGZF Multithreaded Decompression

FastqScan is actually faster on BGZF-compressed files than plaintext due to parallel decompression, if there are sufficient cores (~20). On 80M reads:

  • Plaintext: ~4.2 GB/s (single-threaded)
  • BGZF compressed: ~6.2 GB/s (multithreaded decompression)
  • FastqScanMT (with t=2): ~9.5 GB/s on BGZF

FastqScan Performance Chart

Performance comparison showing FastqScanMT at 13.5x faster than Rust needletail on BGZF files FastqScan_NeedleTail

bbushnell avatar Dec 11 '25 22:12 bbushnell