shorah
shorah copied to clipboard
Error running Amplicon with diversity option
Shorah Version:
1.99.2
Command to reproduce:
$ shorah amplicon -b my_sorted_bam.bam -f amplicon.fasta -d
I have trimmed my read length to equal length prior to this step.
Expected Behavior:
The analysis outputs the entropy.pdf
and entropy.csv
besides everything else when running without the -d
option.
Actual Behavior:
The entropy.pdf
and entropy.csv
are generated, but the software terminates with error:
[mpileup] 1 samples in 1 input files
/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/amplicon.py:255: UserWarning: mpileup column not fully parsed
warnings.warn('mpileup column not fully parsed')
Traceback (most recent call last):
File "/opt/conda/envs/python3/bin/shorah", line 14, in <module>
main()
File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/cli.py", line 210, in main
args.func(args)
File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/cli.py", line 89, in amplicon_run
amplicon.main(args)
File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/amplicon.py", line 387, in main
h = list(open('coverage.txt'))[0]
IndexError: list index out of range
Output from shorah.log
INFO 2021/05/24 13:36:25 cli.py: main() 204: shorah amplicon -b my_sorted_bam.bam -f amplicon.fasta -d
INFO 2021/05/24 13:36:25 cli.py: main() 205: shorah version:1.99.2
INFO 2021/05/24 13:36:25 amplicon.py: main() 352: shorah amplicon -b my_sorted_bam.bam -f amplicon.fasta -d
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 90: samtools view my_sorted_bam.bam | cut -f 10 > rl.txt
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 99: Child samtools returned 0
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 229: n_reads: 500
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 235: trimmed_mean: 388
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 90: samtools mpileup -f amplicon.fasta -d 10000 my_sorted_bam.bam > sample.mpu
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 99: Child samtools returned 0
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 272: start: 1, stop: 388
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 286: highest entropy found at position 22
.
.
.
INFO 2021/05/24 13:36:25 amplicon.py: main() 372: analysing region from -1 to 387
I have traced back in the code and it turns out that the region index return by function highest_entropy()
in src/shorah/amplicon.py
does not match my expectation. When all my reads have the same length, it will skip the steps finding the max entropy (https://github.com/cbg-ethz/shorah/blob/0334e4d509dc93152945a80d5b1a6dbd4af07e61/src/shorah/amplicon.py#L300 to line 304) and returns -1 for the start index of region. While from my understanding towards the code, the finding entropy part should at least be iterated one time. I found that's because the value of rsto
is lower that rsta
in this case. So I tried to fix this issue by changing line 298 from
rsto = min(stop - trimmed_mean, highest_ent_pos + 1)
to
rsto = min(stop - trimmed_mean + 2, highest_ent_pos + 1)
Also, in line 305, the high_ent_stop
should ends where the region ends as the return value to match the the logic invoked the function, so I subtract 1 from high_ent_start + trimmed_mean
, from
high_ent_stop = high_ent_start + trimmed_mean
to
high_ent_stop = high_ent_start + trimmed_mean -1
By applying these two changes, I am able to run the amplicon function on my data with equal length with the -d
option, as the region is now fixed for this rare margin case. I don't know if I understand this part right and make the correct modification. Would you verify it for me to see if this is a bug?
I'm running into the same error, also on v1.99.2, but even in shorah amplicon
mode with default arguments:
$ shorah amplicon -b SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam -f NC_045512.2.fas
Traceback (most recent call last):
File "/usr/local/bin/shorah", line 14, in <module>
main()
File "/usr/local/lib/python3.6/site-packages/shorah/cli.py", line 210, in main
args.func(args)
File "/usr/local/lib/python3.6/site-packages/shorah/cli.py", line 89, in amplicon_run
amplicon.main(args)
File "/usr/local/lib/python3.6/site-packages/shorah/amplicon.py", line 387, in main
h = list(open('coverage.txt'))[0]
IndexError: list index out of range
So I'm not sure that this error is unique to the -d
option, but rather, it seems general to shorah amplicon
. Here's the contents of my log file:
INFO 2021/08/10 14:19:35 cli.py: main() 204: /usr/local/bin/shorah amplicon -b ../SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam -f ../NC_045512.2.fas
INFO 2021/08/10 14:19:35 cli.py: main() 205: shorah version:1.99.2
INFO 2021/08/10 14:19:35 amplicon.py: main() 352: /usr/local/bin/shorah amplicon -b ../SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam -f ../NC_045512.2.fas
INFO 2021/08/10 14:19:35 amplicon.py: main() 372: analysing region from 1 to 29903
DEBUG 2021/08/10 14:19:35 amplicon.py: run_child() 90: /usr/local/bin/b2w -i 0 -w 29903 -m 28407 -x 10000 -c 0 ../SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam ../NC_045512.2.fas
DEBUG 2021/08/10 14:19:40 amplicon.py: run_child() 99: Child /usr/local/bin/b2w returned 0
DEBUG 2021/08/10 14:19:40 amplicon.py: main() 380: b2w returned 0
In addition to the log file, shorah amplicon
created the following files, all of which are 0 bytes:
-
coverage.txt
-
reads.fas
-
w-NC_045512.2-1-29903.reads.fas
The edits proposed by XU-Nuo seem to fix the issue for -d
mode on my end as well, but not in the default parameters
Thank you for reporting. I hope I'll have time to check this next week...
Same issue here. I wonder if the problem might be the arguments sent to b2w.
Lines 377-378 in amplicon.py:
b2w_args = ' -i 0 -w %d -m %d -x %d -c %i%s %s %s %s' % (ref_length, int(
min_overlap * ref_length), max_coverage, cov_thrd, d, in_bam, in_fasta, region)
I'm not sure I understand what b2w does exactly, but this parameter combination of window length equal to reference length and minimum overlap equal to 95% of that always creates empty output files (coverage.txt
and the *reads.fas
files) for me.