HTStream
HTStream copied to clipboard
Order of input files to hts_SeqScreener changes hits reported when R1/R2 lengths differ
Describe the bug Heya, thanks for maintaining HTStream :)
For the attached example files, the number of hits identified by hts_SeqScreener vary depending on whether the forward reads or reverse reads were provided first.
Overall summary:
how-run | num_hits |
---|---|
R1 SE: -U test-R1.fq.gz |
7 |
R2 SE: -U test-R2.fq.gz |
2021 |
R1,R2 PE: -1 test-R1.fq.gz -2 test-R2.fq.gz |
3348 |
R2,R1 PE: -1 test-R2.fq.gz -2 test-R1.fq.gz |
2021 |
- for these data, it is expected that R1 would yield near zero, as they are not biological
- R2 are the biological sequences, and it seems providing things as -1 R2 -2 R1 to hts_SeqScreener returns the expected results based on what it looks like when providing just R2
For the data I was working with when running into this, attached, R1 length is 29 bases and R2 is 121. So the disparity may have to deal with the length of read 1 and if that's used for calculating the percentage-hit cutoff without consideration for read 2 length/total fragment length?
To Reproduce Example sequence reads that will help reproduce the bug:
ref.fa.gz test-R1.fq.gz test-R2.fq.gz
Commands to reproduce the behavior:
mamba create -y -n hts -c conda-forge -c bioconda -c defaults htstream=1.3.3
conda activate hts
### with files attached to issue
## running single
# R1
hts_SeqScreener -U test-R1.fq.gz -s ref.fa.gz -L hts-R1-SE.log > /dev/null
grep -A 5 "Single_end" hts-R1-SE.log | tail -n 1 | sed 's/^/# /'
# "hits": 7
# R2
hts_SeqScreener -U test-R2.fq.gz -s ref.fa.gz -L hts-R2-SE.log > /dev/null
grep -A 5 "Single_end" hts-R2-SE.log | tail -n 1 | sed 's/^/# /'
# "hits": 2021
## running together
# R1 first
hts_SeqScreener -1 test-R1.fq.gz -2 test-R2.fq.gz -s ref.fa.gz -C -L hts-R1-first.log > /dev/null
grep -A 3 "Paired_end" hts-R1-first.log | tail -n 1
# 3348 hits
# R2 first
hts_SeqScreener -1 test-R2.fq.gz -2 test-R1.fq.gz -s ref.fa.gz -C -L hts-R2-first.log > /dev/null
grep -A 3 "Paired_end" hts-R2-first.log | tail -n 1
# 2021 hits
Expected behavior
I would have expected the results to be the same whether R1 or R2 were provided first when given as paired with the -C
option set. But the order provided makes a large difference.
Desktop (please complete the following information):
- observed on MacOS and Ubuntu