chatbot-rnn Issue with train.py

Any thoughts? I am using windows..

Preprocessing file 2/6 (reddit-parse/output\output 1.bz2)... Traceback (most recent call last): File "train.py", line 190, in <module> main() File "train.py", line 49, in main train(args) File "train.py", line 55, in train data_loader = TextLoader(args.data_dir, args.batch_size, args.seq_length) File "D:\bot\utils.py", line 39, in __init__ self._preprocess(self.input_files[i], self.tensor_file_template.format(i)) File "D:\bot\utils.py", line 107, in _preprocess data = file_reference.read() File "D:\python\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 23267: character maps to <undefined>

May 17 '18 07:05 ghost

hi, I'm having the same problem when I'm running train.py on new data.

May 27 '18 18:05 sashasmirnova

This might not be the right solution but...here is a patch for that. https://github.com/neofob/chatbot-rnn/commit/1f56cb941b834c5bc95c8f40fe58ce08277f4d10

May 31 '18 14:05 neofob

Yea @neofob changes the encodings the utils are using to read the training sets, but this should match which encodings you used to write the training data as well. (i.e if your training files are encoded with utf-8, they should be read in utf-8)

Although this allows for training I'm not too sure if the char-rnn works with utf-8 encodings at all since I am just getting gibberish back from the model when trained this way. (https://github.com/karpathy/char-rnn/pull/113)

Jul 30 '18 02:07 zhou-daniel-dz

Any news? Same problem here.

The @neofob patch doesn't work for me: I guess it's because bz2.open errors="ignore" or errors="replace" param is not working.

I am using the same @pender reddit dataset (https://github.com/pender/chatbot-rnn)

Aug 30 '18 11:08 geroale

You just need to make sure the data you're training on is encoded in ANSI.

If your parser must read and write in a different encoding, just save the output text file as ANSI and it should be useable. Clearly certain characters cannot be mapped, but the percentage of those characters seems too small to make a difference.

Aug 30 '18 16:08 zhou-daniel-dz

@neofob @zhou-daniel-dz I try figure out how make char-rnn work with utf-8 but simple path in: utils.py if input_file.endswith(".bz2"): file_reference = bz2.open(input_file, mode='rt', encoding="utf-8", errors="replace") elif input_file.endswith(".txt"): file_reference = io.open(input_file, mode='rt', encoding="utf-8", errors="replace") Don't work for me probably it's not enough?

Nov 12 '18 04:11 remotejob

no just bad or wrong format

Apr 27 '22 11:04 breadbrowser

of bz2 or txt file or file renamed from zst

Apr 27 '22 11:04 breadbrowser

Issue with train.py - chatset errors.