shorah
shorah copied to clipboard
Error running Amplicon with diversity option
Shorah Version:
1.99.2
Command to reproduce:
$ shorah amplicon -b my_sorted_bam.bam -f amplicon.fasta -d
I have trimmed my read length to equal length prior to this step.
Expected Behavior:
The analysis outputs the entropy.pdf
and entropy.csv
besides everything else when running without the -d
option.
Actual Behavior:
The entropy.pdf
and entropy.csv
are generated, but the software terminates with error:
[mpileup] 1 samples in 1 input files
/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/amplicon.py:255: UserWarning: mpileup column not fully parsed
warnings.warn('mpileup column not fully parsed')
Traceback (most recent call last):
File "/opt/conda/envs/python3/bin/shorah", line 14, in <module>
main()
File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/cli.py", line 210, in main
args.func(args)
File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/cli.py", line 89, in amplicon_run
amplicon.main(args)
File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/amplicon.py", line 387, in main
h = list(open('coverage.txt'))[0]
IndexError: list index out of range
Output from shorah.log
INFO 2021/05/24 13:36:25 cli.py: main() 204: shorah amplicon -b my_sorted_bam.bam -f amplicon.fasta -d
INFO 2021/05/24 13:36:25 cli.py: main() 205: shorah version:1.99.2
INFO 2021/05/24 13:36:25 amplicon.py: main() 352: shorah amplicon -b my_sorted_bam.bam -f amplicon.fasta -d
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 90: samtools view my_sorted_bam.bam | cut -f 10 > rl.txt
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 99: Child samtools returned 0
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 229: n_reads: 500
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 235: trimmed_mean: 388
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 90: samtools mpileup -f amplicon.fasta -d 10000 my_sorted_bam.bam > sample.mpu
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 99: Child samtools returned 0
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 272: start: 1, stop: 388
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 286: highest entropy found at position 22
.
.
.
INFO 2021/05/24 13:36:25 amplicon.py: main() 372: analysing region from -1 to 387
I have traced back in the code and it turns out that the region index return by function highest_entropy()
in src/shorah/amplicon.py
does not match my expectation. When all my reads have the same length, it will skip the steps finding the max entropy (https://github.com/cbg-ethz/shorah/blob/0334e4d509dc93152945a80d5b1a6dbd4af07e61/src/shorah/amplicon.py#L300 to line 304) and returns -1 for the start index of region. While from my understanding towards the code, the finding entropy part should at least be iterated one time. I found that's because the value of rsto
is lower that rsta
in this case. So I tried to fix this issue by changing line 298 from
rsto = min(stop - trimmed_mean, highest_ent_pos + 1)
to
rsto = min(stop - trimmed_mean + 2, highest_ent_pos + 1)
Also, in line 305, the high_ent_stop
should ends where the region ends as the return value to match the the logic invoked the function, so I subtract 1 from high_ent_start + trimmed_mean
, from
high_ent_stop = high_ent_start + trimmed_mean
to
high_ent_stop = high_ent_start + trimmed_mean -1
By applying these two changes, I am able to run the amplicon function on my data with equal length with the -d
option, as the region is now fixed for this rare margin case. I don't know if I understand this part right and make the correct modification. Would you verify it for me to see if this is a bug?