diamond Subsetting input files result in higher number of aligned reads

We recently observed that when we subset our input fastq files, we get more aligned reads than when running the complete file at once.

The input file contains 594,589 reads.

Command used to run the complete input file:

diamond blastx -d database.dmnd -q input_file.fastq -p 16 --mid-sensitive -o output_file.txt -f 6 -k 1 -b 10 -c 1 -t /tmp

Command used to subset and run Diamond on the subsets:

split -l 4000 input_file.fastq split_fastq
for f in split_fastq*; do
    diamond blastx -d database.dmnd -q $f -p 16 --mid-sensitive -o ${f}.dmnd.txt -f 6 -k 1 -b 10 -c 1 -t /tmp
done
cat split_fastq*.dmnd.txt > output_file.txt

When running the complete input file, we got 250,621 reads aligned, when running the subset with 100,000 reads per subset, we got 368,446 reads aligned, and when running the subset with 1000 reads per subset, we got 490,287 reads aligned.

Is this normal behavior, and how can we get the output for all reads that aligned with the subsets when running the complete dataset?

Nov 17 '25 09:11 laurentijn

One reason could be the query-indexed mode that is automatically used for small query files, but does not have equivalent sensitivity to the default mode. You can explicitly control this by setting --algo 0 and --algo 1. If you need more sensitivity, a better way to do this is probably to run the whole file and use --sensitive etc. If these are long reads, you should also consider frameshift alignment mode.

Nov 17 '25 14:11 bbuchfink

I have added the --algo 1 argument. To lead this work, I needed to change the --mid-sensitive to the --sensitive argument. This resulted in a higher number of aligned reads. Thanks for the help.

Nov 18 '25 12:11 laurentijn