rnn-tutorial-rnnlm
rnn-tutorial-rnnlm copied to clipboard
not able to get data from csv file to train network in "train-theano.py"
Hey, Im having troubles getting the data to train the RNN. Specifically on this line:
sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
if I open the file as 'rb' i get the error:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
and if I open it up with 'r' i get:
sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
AttributeError: 'str' object has no attribute 'decode'
Im not sure wich is the very basic idea to train the NN with strings or binary codes (guess binary codes). thanks for your time!
Maybe your Python version is 3.x, the code below runs without error under Python 2.7
with open('data/reddit-comments-2015-08.csv', 'rb') as f:
reader = csv.reader(f, skipinitialspace=True)
reader.next()
# Split full comments into sentences
sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
# Append SENTENCE_START and SENTENCE_END
sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]
You can remove ".decode('utf-8')" and try again.
You can remove ".decode('utf-8')" and try again.
Yes, you must remove this but a couple of other changes are also required so that entire line becomes -
with open('data/reddit-comments-2015-08.csv', 'rt', encoding="utf8") as f: