arcsv icon indicating copy to clipboard operation
arcsv copied to clipboard

UnicodeEncodeError: 'ascii' codec can't encode character

Open jiadong324 opened this issue 6 years ago • 2 comments

Hi,

I am running arcsv for complex SVs. The error raises at line 22 of the function below, which I highlighted by stars.

def sv_affected_len(path, blocks):
    # ref_path = list(range(0, 2 * len(blocks)))
    n_ref = len([x for x in blocks if not x.is_insertion()])
    ref_block_num = list(range(n_ref))
    ref_string = ''.join(chr(x) for x in range(ord('A'), ord('A') + n_ref))

    print('ref_string: {0}'.format(ref_string))

    path_block_num = []
    path_string = ''
    for i in path[1::2]:
        block_num = int(np.floor(i / 2))
        path_block_num.append(block_num)
        if i % 2 == 1:          # forward orientation
            path_string += chr(ord('A') + block_num)
        else:                   # reverse orientation
            path_string += chr(ord('A') + block_num + 1000)

   **print('path_string: {0}'.format(path_string))**

    affected_idx_1, affected_idx_2 = align_strings(ref_string, path_string)
    affected_block_1 = set(ref_block_num[x] for x in affected_idx_1)
    affected_block_2 = set(path_block_num[x] for x in affected_idx_2)
    affected_blocks = affected_block_1.union(affected_block_2)
    
    affected_len = sum(len(blocks[i]) for i in affected_blocks)
    return affected_len

Please help, thanks!

jiadong324 avatar Nov 29 '19 20:11 jiadong324

I am also hitting this problem. Entries in the VCF you output have non-ascii characters. Here is an example VCF entry: 6 26555722 6_25759412-26580983_2 T <DEL> . PASS SV_TYPE=DEL;END=26555744;CI_POS=-7,7;CI_END=-7,7;SR=1;PE=2;SV_SPAN=22;EVENT\ _SPAN=41357;EVENT_START=26514387;EVENT_END=26555743;EVENT_AFFECTED_LEN=69;EVENT_NUM_SV=2;REF_STRUCTURE=ABCDE;ALT_STRUCTURE=ABBBCE;SEGMENT_ENDPTS=25759412,2\ 6514387,26514434,26555722,26555744,26580984;SEGMENT_ENDPTS_CIWIDTH=0,9,12,14,14,0;AF=0.500;SCORE_VS_REF=235.57;SCORE_VS_NEXT=0.33;NEXT_BEST_STRUCTURE=ABCDE\ FGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz{||}~^?\200\201\202\203\204\205\206\207\210\211\213\214\215\216\217\220\221/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef\ ghijklmnopqrstuvwxyz{|}~^?\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221;NUM_PATHS=448 GT 1/0

The string "^?\200\201\202\203\204\205\206\207\210\211\213\214\215\216\217\220\221" appears as a space in editors that can handle the mixed encodings. HOWEVER, this is a serious problem as standard VCF parsers all choke on these entries. For the moment I have to throw away these calls to use the rest of the data, not an ideal situation.

johnemajor avatar Dec 20 '19 07:12 johnemajor

Thanks for raising the issues here. Looks like I'll need to implement a better solution for naming genomic segments that scales further.

jgarthur avatar Mar 31 '24 21:03 jgarthur

The issue is fixed here: https://github.com/SUwonglab/arcsv/pull/14

jgarthur avatar Jun 24 '24 00:06 jgarthur