checklist
checklist copied to clipboard
Encoding issue in Qindows with TestSuite.to_raw_file
This line f = open(path, 'w'') needs to have encoding='utf-8' to work properly in Windows and avoid the following error when characters are not
UnicodeEncodeError: 'charmap' codec can't encode character <unhandled char> in position 18458: character maps to <undefined>
Modified below:
def to_raw_file(self, path, file_format=None, format_fn=None, header=None, n=None, seed=None, new_sample=True):
"""Flatten all tests into individual examples and print them to file.
Indices of example to test case will be stored in each test.
If n is not None, test.run_idxs will store the test case indexes.
The line ranges for each test will be saved in self.test_ranges.
Parameters
----------
path : string
File path
file_format : string, must be one of 'jsonl', 'squad', 'qqp_test', or None
None just calls str(x) for each example in self.data
squad assumes x has x['question'] and x['passage'], or that format_fn does this
format_fn : function or None
If not None, call this function to format each example in self.data
header : string
If not None, first line of file
n : int
If not None, number of samples to draw
seed : int
Seed to use if n is not None
new_sample: bool
If False, will rely on a previous sample and ignore the 'n' and 'seed' parameters
"""
ret = ''
all_examples = []
add_id = False
if file_format == 'qqp_test':
add_id = True
file_format = 'tsv'
header = 'id\tquestion1\tquestion2'
if header is not None:
ret += header.strip('\n') + '\n'
all_examples = self.get_raw_examples(file_format=file_format, format_fn=format_fn, n=n, seed=seed, new_sample=new_sample)
if add_id and file_format == 'tsv':
all_examples = ['%d\t%s' % (i, x) for i, x in enumerate(all_examples)]
if file_format == 'squad':
ret_map = {'version': 'fake',
'data': []}
for i, x in enumerate(all_examples):
r = {'title': '',
'paragraphs': [{
'context': x['passage'],
'qas': [{'question' : x['question'],
'id': str(i)
}]
}]
}
ret_map['data'].append(r)
ret = json.dumps(ret_map)
else:
ret += '\n'.join(all_examples)
f = open(path, 'w', encoding='utf-8')
f.write(ret)
f.close()
@LoganKells Could you provide a short code snippet that produces this error? I have a branch with a simple fix: https://github.com/emc5ud/checklist/commit/6eb239546243a6796a20fc4950b7c8c7a2527bdf
But I'd like to test it.