checklist icon indicating copy to clipboard operation
checklist copied to clipboard

Encoding issue in Qindows with TestSuite.to_raw_file

Open LoganKells opened this issue 4 years ago • 1 comments

This line f = open(path, 'w'') needs to have encoding='utf-8' to work properly in Windows and avoid the following error when characters are not

UnicodeEncodeError: 'charmap' codec can't encode character <unhandled char> in position 18458: character maps to <undefined>

Modified below:

    def to_raw_file(self, path, file_format=None, format_fn=None, header=None, n=None, seed=None, new_sample=True):
        """Flatten all tests into individual examples and print them to file.
        Indices of example to test case will be stored in each test.
        If n is not None, test.run_idxs will store the test case indexes.
        The line ranges for each test will be saved in self.test_ranges.

        Parameters
        ----------
        path : string
            File path
        file_format : string, must be one of 'jsonl', 'squad', 'qqp_test', or None
            None just calls str(x) for each example in self.data
            squad assumes x has x['question'] and x['passage'], or that format_fn does this
        format_fn : function or None
            If not None, call this function to format each example in self.data
        header : string
            If not None, first line of file
        n : int
            If not None, number of samples to draw
        seed : int
            Seed to use if n is not None
        new_sample: bool
            If False, will rely on a previous sample and ignore the 'n' and 'seed' parameters

        """
        ret = ''
        all_examples = []
        add_id = False
        if file_format == 'qqp_test':
            add_id = True
            file_format = 'tsv'
            header = 'id\tquestion1\tquestion2'
        if header is not None:
            ret += header.strip('\n') + '\n'
        all_examples = self.get_raw_examples(file_format=file_format, format_fn=format_fn, n=n, seed=seed, new_sample=new_sample)

        if add_id and file_format == 'tsv':
            all_examples = ['%d\t%s' % (i, x) for i, x in enumerate(all_examples)]
        if file_format == 'squad':
            ret_map = {'version': 'fake',
                       'data': []}
            for i, x in enumerate(all_examples):
                r = {'title': '',
                     'paragraphs': [{
                        'context': x['passage'],
                        'qas': [{'question' : x['question'],
                                 'id': str(i)
                                 }]
                      }]
                    }
                ret_map['data'].append(r)
            ret = json.dumps(ret_map)
        else:
            ret += '\n'.join(all_examples)
        f = open(path, 'w', encoding='utf-8')
        f.write(ret)
        f.close()

LoganKells avatar Nov 21 '21 23:11 LoganKells

@LoganKells Could you provide a short code snippet that produces this error? I have a branch with a simple fix: https://github.com/emc5ud/checklist/commit/6eb239546243a6796a20fc4950b7c8c7a2527bdf

But I'd like to test it.

emc5ud avatar Dec 17 '21 16:12 emc5ud