rnn-tutorial-rnnlm icon indicating copy to clipboard operation
rnn-tutorial-rnnlm copied to clipboard

not able to get data from csv file to train network in "train-theano.py"

Open totial opened this issue 8 years ago • 3 comments

Hey, Im having troubles getting the data to train the RNN. Specifically on this line: sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader]) if I open the file as 'rb' i get the error:

_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

and if I open it up with 'r' i get:

sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])

AttributeError: 'str' object has no attribute 'decode'

Im not sure wich is the very basic idea to train the NN with strings or binary codes (guess binary codes). thanks for your time!

totial avatar Feb 03 '17 15:02 totial

Maybe your Python version is 3.x, the code below runs without error under Python 2.7

with open('data/reddit-comments-2015-08.csv', 'rb') as f:
    reader = csv.reader(f, skipinitialspace=True)
    reader.next()
    # Split full comments into sentences
    sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
    # Append SENTENCE_START and SENTENCE_END
    sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]

GoingMyWay avatar May 05 '17 01:05 GoingMyWay

You can remove ".decode('utf-8')" and try again.

chrischang80 avatar Feb 12 '18 10:02 chrischang80

You can remove ".decode('utf-8')" and try again.

Yes, you must remove this but a couple of other changes are also required so that entire line becomes - with open('data/reddit-comments-2015-08.csv', 'rt', encoding="utf8") as f:

Pavonlo avatar Feb 07 '19 19:02 Pavonlo