Subsetting input files result in higher number of aligned reads
We recently observed that when we subset our input fastq files, we get more aligned reads than when running the complete file at once.
The input file contains 594,589 reads.
Command used to run the complete input file:
diamond blastx -d database.dmnd -q input_file.fastq -p 16 --mid-sensitive -o output_file.txt -f 6 -k 1 -b 10 -c 1 -t /tmp
Command used to subset and run Diamond on the subsets:
split -l 4000 input_file.fastq split_fastq
for f in split_fastq*; do
diamond blastx -d database.dmnd -q $f -p 16 --mid-sensitive -o ${f}.dmnd.txt -f 6 -k 1 -b 10 -c 1 -t /tmp
done
cat split_fastq*.dmnd.txt > output_file.txt
When running the complete input file, we got 250,621 reads aligned, when running the subset with 100,000 reads per subset, we got 368,446 reads aligned, and when running the subset with 1000 reads per subset, we got 490,287 reads aligned.
Is this normal behavior, and how can we get the output for all reads that aligned with the subsets when running the complete dataset?
One reason could be the query-indexed mode that is automatically used for small query files, but does not have equivalent sensitivity to the default mode. You can explicitly control this by setting --algo 0 and --algo 1. If you need more sensitivity, a better way to do this is probably to run the whole file and use --sensitive etc. If these are long reads, you should also consider frameshift alignment mode.
I have added the --algo 1 argument. To lead this work, I needed to change the --mid-sensitive to the --sensitive argument. This resulted in a higher number of aligned reads. Thanks for the help.