training_results_v0.6 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Open romanzac opened this issue 5 years ago • 1 comments

Hi All,

Problem with dataset or code ? Thanks for any hints.

Run: training_results_v0.6/NVIDIA/benchmarks/gnmt/implementations/download_dataset.sh

Error: Input sentences: 4562102 Output sentences: 4524868 Cleaning data/train.tok... perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = "C.UTF-8", LANG = "C.UTF-8" are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). clean-corpus.perl: processing data/train.tok.de & .en to data/train.tok.clean, cutoff 1-80, ratio 9 ..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000)..........(1200000)..........(1300000)..........(1400000)..........(1500000)..........(1600000)..........(1700000)..........(1800000)..........(1900000)..........(2000000)..........(2100000)..........(2200000)..........(2300000)..........(2400000)..........(2500000)..........(2600000)..........(2700000)..........(2800000)..........(2900000)..........(3000000)..........(3100000)..........(3200000)..........(3300000)..........(3400000)..........(3500000)..........(3600000)..........(3700000)..........(3800000)..........(3900000)..........(4000000)..........(4100000)..........(4200000)..........(4300000)..........(4400000)..........(4500000)...... Input sentences: 4562102 Output sentences: 4500966 Traceback (most recent call last): File "pytorch/scripts/filter_dataset.py", line 79, in main() File "pytorch/scripts/filter_dataset.py", line 55, in main for idx, lines in enumerate(zip(f1, f2)): File "/usr/lib64/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6727: ordinal not in range(128)

Nov 07 '19 08:11 romanzac

Are these variables a part of your environment?

export LANG=C.UTF-8 
export LC_ALL=C.UTF-8

Feb 25 '20 16:02 nileshnegi

training_results_v0.6 training_results_v0.6 copied to clipboard

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

training_results_v0.6
training_results_v0.6 copied to clipboard