shorah icon indicating copy to clipboard operation
shorah copied to clipboard

Error running Amplicon with diversity option

Open XU-Nuo opened this issue 3 years ago • 3 comments

Shorah Version:

1.99.2

Command to reproduce:

$ shorah amplicon  -b my_sorted_bam.bam -f amplicon.fasta -d

I have trimmed my read length to equal length prior to this step.

Expected Behavior:

The analysis outputs the entropy.pdf and entropy.csv besides everything else when running without the -d option.

Actual Behavior:

The entropy.pdf and entropy.csv are generated, but the software terminates with error:

[mpileup] 1 samples in 1 input files
/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/amplicon.py:255: UserWarning: mpileup column not fully parsed
  warnings.warn('mpileup column not fully parsed')
Traceback (most recent call last):
  File "/opt/conda/envs/python3/bin/shorah", line 14, in <module>
    main()
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/cli.py", line 210, in main
    args.func(args)
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/cli.py", line 89, in amplicon_run
    amplicon.main(args)
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/amplicon.py", line 387, in main
    h = list(open('coverage.txt'))[0]
IndexError: list index out of range

Output from shorah.log

INFO 2021/05/24 13:36:25 cli.py: main() 204: 	shorah amplicon  -b my_sorted_bam.bam -f amplicon.fasta -d
INFO 2021/05/24 13:36:25 cli.py: main() 205: 	shorah version:1.99.2

INFO 2021/05/24 13:36:25 amplicon.py: main() 352: 	shorah amplicon  -b my_sorted_bam.bam -f amplicon.fasta -d
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 90: 	samtools view my_sorted_bam.bam | cut -f 10 > rl.txt
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 99: 	Child samtools returned 0
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 229: 	n_reads: 500
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 235: 	trimmed_mean: 388
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 90: 	samtools mpileup  -f amplicon.fasta -d 10000 my_sorted_bam.bam > sample.mpu
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 99: 	Child samtools returned 0
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 272: 	start: 1, stop: 388
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 286: 	highest entropy found at position 22
.
.
.
INFO 2021/05/24 13:36:25 amplicon.py: main() 372: 	analysing region from -1 to 387

I have traced back in the code and it turns out that the region index return by function highest_entropy() in src/shorah/amplicon.pydoes not match my expectation. When all my reads have the same length, it will skip the steps finding the max entropy (https://github.com/cbg-ethz/shorah/blob/0334e4d509dc93152945a80d5b1a6dbd4af07e61/src/shorah/amplicon.py#L300 to line 304) and returns -1 for the start index of region. While from my understanding towards the code, the finding entropy part should at least be iterated one time. I found that's because the value of rsto is lower that rsta in this case. So I tried to fix this issue by changing line 298 from

rsto = min(stop - trimmed_mean, highest_ent_pos + 1)

to

rsto = min(stop - trimmed_mean + 2, highest_ent_pos + 1)

Also, in line 305, the high_ent_stop should ends where the region ends as the return value to match the the logic invoked the function, so I subtract 1 from high_ent_start + trimmed_mean, from

high_ent_stop = high_ent_start + trimmed_mean

to

high_ent_stop = high_ent_start + trimmed_mean -1

By applying these two changes, I am able to run the amplicon function on my data with equal length with the -d option, as the region is now fixed for this rare margin case. I don't know if I understand this part right and make the correct modification. Would you verify it for me to see if this is a bug?

XU-Nuo avatar May 24 '21 09:05 XU-Nuo