HTStream icon indicating copy to clipboard operation
HTStream copied to clipboard

Order of input files to hts_SeqScreener changes hits reported when R1/R2 lengths differ

Open AstrobioMike opened this issue 1 year ago • 0 comments

Describe the bug Heya, thanks for maintaining HTStream :)

For the attached example files, the number of hits identified by hts_SeqScreener vary depending on whether the forward reads or reverse reads were provided first.

Overall summary:

how-run num_hits
R1 SE: -U test-R1.fq.gz 7
R2 SE: -U test-R2.fq.gz 2021
R1,R2 PE: -1 test-R1.fq.gz -2 test-R2.fq.gz 3348
R2,R1 PE: -1 test-R2.fq.gz -2 test-R1.fq.gz 2021
  • for these data, it is expected that R1 would yield near zero, as they are not biological
  • R2 are the biological sequences, and it seems providing things as -1 R2 -2 R1 to hts_SeqScreener returns the expected results based on what it looks like when providing just R2

For the data I was working with when running into this, attached, R1 length is 29 bases and R2 is 121. So the disparity may have to deal with the length of read 1 and if that's used for calculating the percentage-hit cutoff without consideration for read 2 length/total fragment length?

To Reproduce Example sequence reads that will help reproduce the bug:

ref.fa.gz test-R1.fq.gz test-R2.fq.gz

Commands to reproduce the behavior:

mamba create -y -n hts -c conda-forge -c bioconda -c defaults htstream=1.3.3

conda activate hts

### with files attached to issue

## running single
# R1
hts_SeqScreener -U test-R1.fq.gz -s ref.fa.gz -L hts-R1-SE.log > /dev/null
grep -A 5 "Single_end" hts-R1-SE.log | tail -n 1 | sed 's/^/# /'
#         "hits": 7

# R2
hts_SeqScreener -U test-R2.fq.gz -s ref.fa.gz -L hts-R2-SE.log > /dev/null
grep -A 5 "Single_end" hts-R2-SE.log | tail -n 1 | sed 's/^/# /'
#         "hits": 2021

## running together
# R1 first
hts_SeqScreener -1 test-R1.fq.gz -2 test-R2.fq.gz -s ref.fa.gz -C -L hts-R1-first.log > /dev/null
grep -A 3 "Paired_end" hts-R1-first.log | tail -n 1
    # 3348 hits

# R2 first
hts_SeqScreener -1 test-R2.fq.gz -2 test-R1.fq.gz -s ref.fa.gz -C -L hts-R2-first.log > /dev/null
grep -A 3 "Paired_end" hts-R2-first.log | tail -n 1
    # 2021 hits

Expected behavior I would have expected the results to be the same whether R1 or R2 were provided first when given as paired with the -C option set. But the order provided makes a large difference.

Desktop (please complete the following information):

  • observed on MacOS and Ubuntu

AstrobioMike avatar Jun 06 '23 21:06 AstrobioMike