handicap=int(sgf_prop(props.get('HA', [0]))) ValueError: invalid literal for int() with base 10: '吴受先'
Open
greatken999
opened this issue 8 years ago
•
5 comments
when i run "python3 main.py preprocess data/other/tmp/wuqingyuan/" get this error info:
366 sgfs found.
Estimated number of chunks: 17
Traceback (most recent call last):
File "main.py", line 94, in
argh.dispatch(parser)
File "/usr/local/lib/python3.5/dist-packages/argh/dispatching.py", line 174, in dispatch
for line in lines:
File "/usr/local/lib/python3.5/dist-packages/argh/dispatching.py", line 277, in _execute_command
for line in result:
File "/usr/local/lib/python3.5/dist-packages/argh/dispatching.py", line 260, in _call
result = function(*positional, **keywords)
File "main.py", line 49, in preprocess
test_chunk, training_chunks = parse_data_sets(*data_sets)
File "/mnt/ken-volume/MuGo/load_data_sets.py", line 140, in parse_data_sets
test_chunk, training_chunks = split_test_training(positions_w_context, est_num_positions)
File "/mnt/ken-volume/MuGo/load_data_sets.py", line 60, in split_test_training
positions_w_context = list(positions_w_context)
File "/mnt/ken-volume/MuGo/load_data_sets.py", line 52, in get_positions_from_sgf
for position_w_context in replay_sgf(f.read()):
File "/mnt/ken-volume/MuGo/sgf_wrapper.py", line 124, in replay_sgf
handicap=int(sgf_prop(props.get('HA', [0]))),
ValueError: invalid literal for int() with base 10: '吴受先'
it's look same sgf file props.ge('HA',[0]) get a string ,not a int.
#with open(file) as f:
with open(file,'rt',encoding='gb18030',errors='iqnore') as f:
to fix bug 👍 :
366 sgfs found.
Estimated number of chunks: 17
Traceback (most recent call last):
File "main.py", line 94, in
argh.dispatch(parser)
File "/usr/lib/python3.5/site-packages/argh/dispatching.py", line 174, in dispatch
for line in lines:
File "/usr/lib/python3.5/site-packages/argh/dispatching.py", line 277, in _execute_command
for line in result:
File "/usr/lib/python3.5/site-packages/argh/dispatching.py", line 260, in _call
result = function(*positional, **keywords)
File "main.py", line 49, in preprocess
test_chunk, training_chunks = parse_data_sets(*data_sets)
File "/home/ken/ai/go/MuGo/load_data_sets.py", line 140, in parse_data_sets
test_chunk, training_chunks = split_test_training(positions_w_context, est_num_positions)
File "/home/ken/ai/go/MuGo/load_data_sets.py", line 60, in split_test_training
positions_w_context = list(positions_w_context)
File "/home/ken/ai/go/MuGo/load_data_sets.py", line 52, in get_positions_from_sgf
for position_w_context in replay_sgf(f.read()):
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 5: invalid continuation byte
Oh.. ugh, this makes me sad.
So, the SGF file should declare that its encoding is GB18030; I can't just assume it. Most western-generated SGFs assume UTF-8, so putting in this new assumption would just break the other half of SGFs.
The other issue is that the HA property should be a number http://www.red-bean.com/sgf/go.html#types , not "Wu played first", even though that was the convention back then. I can't really ask you to go fix whatever SGF editor created these files, though, so I think the best I could do is just have a try-except to try different encodings.
encoding bug fixed , tested ok both utf-8 and GB18030 sgf files.
need rum "pip3 install cchardet" to install cchardet modulle first
change load_data_sets.py line 48 to:
import cchardet as chardet
def get_positions_from_sgf(file):
with open(file,'rb') as f:
result = chardet.detect(f.read())['encoding']
f.close
with open(file,'rt',encoding=result,errors='iqnore') as f: