UnicodeEncodeError: 'ascii' codec can't encode character
Hi,
I am running arcsv for complex SVs. The error raises at line 22 of the function below, which I highlighted by stars.
def sv_affected_len(path, blocks):
# ref_path = list(range(0, 2 * len(blocks)))
n_ref = len([x for x in blocks if not x.is_insertion()])
ref_block_num = list(range(n_ref))
ref_string = ''.join(chr(x) for x in range(ord('A'), ord('A') + n_ref))
print('ref_string: {0}'.format(ref_string))
path_block_num = []
path_string = ''
for i in path[1::2]:
block_num = int(np.floor(i / 2))
path_block_num.append(block_num)
if i % 2 == 1: # forward orientation
path_string += chr(ord('A') + block_num)
else: # reverse orientation
path_string += chr(ord('A') + block_num + 1000)
**print('path_string: {0}'.format(path_string))**
affected_idx_1, affected_idx_2 = align_strings(ref_string, path_string)
affected_block_1 = set(ref_block_num[x] for x in affected_idx_1)
affected_block_2 = set(path_block_num[x] for x in affected_idx_2)
affected_blocks = affected_block_1.union(affected_block_2)
affected_len = sum(len(blocks[i]) for i in affected_blocks)
return affected_len
Please help, thanks!
I am also hitting this problem. Entries in the VCF you output have non-ascii characters. Here is an example VCF entry:
6 26555722 6_25759412-26580983_2 T <DEL> . PASS SV_TYPE=DEL;END=26555744;CI_POS=-7,7;CI_END=-7,7;SR=1;PE=2;SV_SPAN=22;EVENT\ _SPAN=41357;EVENT_START=26514387;EVENT_END=26555743;EVENT_AFFECTED_LEN=69;EVENT_NUM_SV=2;REF_STRUCTURE=ABCDE;ALT_STRUCTURE=ABBBCE;SEGMENT_ENDPTS=25759412,2\ 6514387,26514434,26555722,26555744,26580984;SEGMENT_ENDPTS_CIWIDTH=0,9,12,14,14,0;AF=0.500;SCORE_VS_REF=235.57;SCORE_VS_NEXT=0.33;NEXT_BEST_STRUCTURE=ABCDE\ FGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz{||}~^?\200\201\202\203\204\205\206\207\210\211\213\214\215\216\217\220\221/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef\ ghijklmnopqrstuvwxyz{|}~^?\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221;NUM_PATHS=448 GT 1/0
The string "^?\200\201\202\203\204\205\206\207\210\211\213\214\215\216\217\220\221" appears as a space in editors that can handle the mixed encodings. HOWEVER, this is a serious problem as standard VCF parsers all choke on these entries. For the moment I have to throw away these calls to use the rest of the data, not an ideal situation.
Thanks for raising the issues here. Looks like I'll need to implement a better solution for naming genomic segments that scales further.
The issue is fixed here: https://github.com/SUwonglab/arcsv/pull/14