Circle-Map
Circle-Map copied to clipboard
An error happenend during execution. Exiting
Hi Inigo,
I have multiple samples where Circle-Map will fail with:
"An error happenend during execution. Exiting"
stderr:
[E::idx_find_and_load] Could not retrieve index file for 'output/circle-map/namesort_bams/tissue1.namesort.bam'
0%| | 0/800 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
0%| | 1/800 [00:14<3:17:10, 14.81s/it]
0%| | 1/800 [00:14<3:18:24, 14.90s/it]
0it [00:14, ?it/s]
stdout:
2020-10-13 14:06:35: Realigning reads using Circle-Map
2020-10-13 14:06:35: Clustering structural variant reads
2020-10-13 14:25:21: Splitting clusters to to processors
2020-10-13 14:25:38: An error happenend during execution. Exiting
-
Perhaps this is shared with Issue #35 ? However, the speed isn't my issue, but rather the early exit / run failure. One of my samples exits very quickly, so maybe I'll be able to debug with that sample as my test case. I'll let you know if I have any luck.
-
When this error occurs, could the call to sys.exit() be modified to return exit code 1? Or any non-zero value? With the default return status of 0, wrapper job scripts around Circle-Map will not be able to detect that an error has occurred.
I ran into the same problem of early exiting.
Hi Inigo,
I've done a bit of digging into my failing samples.
When I re-run Circlemap-Realign interactively, I get additional warning messages that were not captured in the stdout / stderr when the same commands were submitted as a batch job. These warnings are similar to those encountered in #46 .
[E::idx_find_and_load] Could not retrieve index file for 'output/circle-map/namesort_bams/tissue1.namesort.bam'
0%| Traceback (most recent call last): | 0/800 [00:00<?, ?it/s]
File "/home/michael/.snakemake/conda/68e54810/lib/python3.8/site-packages/circlemap/realigner.py", line 305, in realign
prob = realignment_probability(realignment_dict,interval_length)
File "/home/michael/.snakemake/conda/68e54810/lib/python3.8/site-packages/circlemap/utils.py", line 1111, in realignment_probability
posterior = 2**best_hit/(np.sum((2**value[2]) for key,value in hit_dict['alignments'].items()))
ZeroDivisionError: float division by zero
/home/michael/.snakemake/conda/68e54810/lib/python3.8/site-packages/circlemap/realigner.py:389: UserWarning: Failed on interval chrom chr3
start 93470357
end 93470657
Name: 0, dtype: object due to the error float division by zero
warnings.warn(
0%| | 1/800 [00:11<2:38:47, 11.92s/it]
2020-10-20 20:58:02: An error happenend during execution. Exiting
0it [00:11, ?it/s]
For context, this region of hg38 chr3:93470357-93470657 is:
- Very high coverage, spanning repetitive sequence
- Spans the right edge of pericentromeric NNNNs in the chr3 sequence.
The error indicates that utils.py, line 1111: realignment_probability divides with a denominator of zero.
I'm not sure how the denominator of
posterior = 2**best_hit/(np.sum((2**value[2]) for key,value in hit_dict['alignments'].items()))
evaluates to zero.
If realignment_dict does not exist, this should be caught by the preceding realignment_dict == None
.
Is it possible to have large negative values (e.g. value[2] = -10000)?
If so, this could lead to:
>>> 2**-10000
0.0
As realignment_probability() is only used to trigger if-else conditions for read re-alignment (bam2bam.py, line 263; realigner.py, line 305), could a try-except be added such that if divide-by-zero occurs, realignment_probability() simply returns zero? This would result in a 'pass' for any failing reads, allowing Circle-Map Realign to continue.
I've edited utils.py such that:
def realignment_probability(hit_dict,interval_length):
"""Function that takes as input the realignment dictionary and returns the alignment probability of the best hit"""
best_hit = hit_dict['alignments'][1][2]
#this might be included on the denominator
try:
posterior = 2**best_hit/(np.sum((2**value[2]) for key,value in hit_dict['alignments'].items()))
except ZeroDivisionError:
posterior = 0
return(posterior)
And now my sample is running beyond the point where it previously failed.
Hi Michael,
This sounds great. I have been on vacation and I will start to look at the pile of issues tomorrow. By the way, I really appreciate your effort!
Hi Inigo,
I hope you enjoyed your vacation!
I believe I have found another source of ZeroDivisionError
that leads to this error.
I mention the ZeroDivisionError
in realignment_probability()
above.
I also encounter ZeroDivisionError
in pssm()
in a small subset of samples in regions where a 300 bp window in the genome is lacking certain nucleotide.
Error message:
realigner.py, line 295 realignment_dict = realign(read,self.n_hits,plus_coding_interval,minus_coding_interval,
plus_base_freqs,minus_base_freqs,self.gap_open,self.gap_ext,self.verbose,edits_allowed)
Points to:
utils.py, line 962, within realign()score = pssm(phred_to_prob(np.array(soft_clipped_read['qual'],dtype=np.float64)), encoded_nucs, edlib_cigar_to_iterable(alignment['cigar']), plus_base_freqs,gap_open,gap_extend,nuc_and_ops,verbose)
And finally:
ZeroDivisionError: division by zero
The pssm()
function in utils divides by base frequencies.
This is the reference sequence of the interval that was being analysed when the run failed:
>chrUn_KN707709v1_decoy:1-300
CACACGCACACAGAGACACACGCACACAGAGACACACGCACACAGAGACACACGCACACA
GAGACACGCAGAGAGAGACACGCACAGAGAGACACGCACAGAGAGACACGCACACAGAGA
GACACGCACACAGAGAGACACGCACACAGAGAGACACGCACACAGAGAGACACGCACACA
GAGAGACACGCACACACGCACAGAGAGACACACAGACACAGAGACAGAGAGAGACAGAGA
GAGAGACACAGAGAGACAGAGAGAGAGACACAGAGACAGAGACACAGAGACAGAGAGAGA
In another sample that fails, the sequence is:
>chrUn_KN707709v1_decoy:1200-1500
AGAGAGAGAGACACACACGGAGAGAGAGAGAGACACACACGGAGAGAGAGAGAGAGACAC
ACGGAGAGAGAGAGAGAGAGACACGGAGAGAGAGAGAGAGACACGGAGAGAGAGAGACAC
ACGGAGAGAGAGAGAGAGACACACGGAGAGAGAGAGACACACGGAGAGAGAGAGACACAC
GGAGAGAGAGAGACACACGGAGAGAGAGAGAGAGACACGGAGAGAGAGAGAGAGACACAC
GGAGAGAGAGAGAGAGACACGGAGAGAGAGAGAGAGAGACACACGGAGAGAGAGAGAGAG
My bet is the lack of T causes base_freq of T to be 0, and division by zero. Can additional try-except be added to account for this possibility?
Or, if it doesn't break the base frequency math / logic, could a minimum base_freq be set to a low non-zero value, simply to avoid the ZeroDivisionError
?
Something like:
base_freq = max(base_freq, 0.001)
Hi Inigo,
Regarding the second source of ZeroDivisionError:
The function background_freqs() is indirectly used to calculate a division denominator within pssm().
My fix was to edit utils.py such that:
def background_freqs(seq):
"""Function that takes as input the sequence of the nucletide frequencies in the realignment interval"""
# return{nucleotide: seq.count(nucleotide)/len(seq) for nucleotide in 'ATCG'}
return{nucleotide: max(1,seq.count(nucleotide))/len(seq) for nucleotide in 'ATCG'}
This forces the value to be at least 1, which still results in a very low base_freq within pssm() but at least it is non-zero.
The samples that previously failed are now succeeding.
Dear Jon,
sorry for the late reply. I am incredibly busy.
First of all, thanks a lot for taking care of this issues. It would have take me a lot of time to figure out what was going on. Would you up to put this into a pull-request?
On another note: Could you contact me by email? I tried looking for your mail but did not manage to find you.
Best,
Inigo
I was already considering making the pull request, but I hadn't found the time. I'll see if I can get to it soon!
And yes, just e-mailed you separately.
Cheers, Michael
I just pushed a patch that fixes the ZeroDivisionError of the realignment probability.
Thanks a lot for your help John. I am currently testing it and it seems to work. If it works, I will submit it to bioconda and do a proper release
Hi Inigo, I'm still running into the same error, do you mind telling me how you fixed the error? Thanks.
Best, Robin