PyCon-Canada-2019-NLP-Tutorial
PyCon-Canada-2019-NLP-Tutorial copied to clipboard
BBC News_LSTM
Hi Susan, just sharing that I needed to add this line for charts to appear:
import matplotlib.pyplot as plt
Also, I added this lines in the beggining to make epochs run:
config = ConfigProto() config.gpu_options.allow_growth = True session = InteractiveSession(config=config)
Other thing I just noticed: If I change the "wework example" for anything, like for example this txt: "Microsoft released a new technology for computers. Google and Apple released new smartphones today". The prediction does not change.
I printed the "padded" version of the txt and it "starts with 0's", counterintuitive.. as I expected it ended with 0's as the berlin article when padded. Leading zeroes:
[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269 409 1216 8 82 1636 959 995 1676 428 409 8 1 432]]
I changed the prediction "padded" with this:
padded = pad_sequences(seq, maxlen=max_length, padding=padding_type, truncating=trunc_type)
After modifying the pad_sequences, I had to change the labels to this:
labels = ['none','sport', 'bussiness', 'politics', 'tech', 'entertainment']
adding "none" because prediction is a number from 0 to 5
Right. labels should be ['none','sport', 'bussiness', 'politics', 'tech', 'entertainment']
The last layer outputed for labels 0, 1, 2, 3, 4, 5 although 0 has never been used.
Right. labels should be
['none','sport', 'bussiness', 'politics', 'tech', 'entertainment']
The last layer outputed for labels 0, 1, 2, 3, 4, 5 although 0 has never been used.
I also dont understand how the labels are ordered in this list on the end where the prediction is happening and choosing one label from the array by argmax as index? Could someone please explain if knows. thanks
Right. labels should be
['none','sport', 'bussiness', 'politics', 'tech', 'entertainment']
The last layer outputed for labels 0, 1, 2, 3, 4, 5 although 0 has never been used.I also dont understand how the labels are ordered in this list on the end where the prediction is happening and choosing one label from the array by argmax as index? Could someone please explain if knows. thanks
Hi labels are converted to numbers using the Tokenizer() function. The tokenizer function chooses the numbers for the labels.
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))
validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))
At the end the "argmax" function is used to select the top most value. Example, if the array is [0.0,0.2,0.9,0.4,0.3,0.7] the argmax will return 3 (the third item in the array is the topmost) which corresponds to "Politics".
Right. labels should be
['none','sport', 'bussiness', 'politics', 'tech', 'entertainment']
The last layer outputed for labels 0, 1, 2, 3, 4, 5 although 0 has never been used.I also dont understand how the labels are ordered in this list on the end where the prediction is happening and choosing one label from the array by argmax as index? Could someone please explain if knows. thanks
Hi labels are converted to numbers using the Tokenizer() function. The tokenizer function chooses the numbers for the labels.
label_tokenizer = Tokenizer() label_tokenizer.fit_on_texts(labels) training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels)) validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))
At the end the "argmax" function is used to select the top most value. Example, if the array is [0.0,0.2,0.9,0.4,0.3,0.7] the argmax will return 3 (the third item in the array is the topmost) which corresponds to "Politics".
Ok its more clear now. So, you check manually how the tokenizer converted the classes to numbers and than ordered them such as ['none','sport', 'bussiness', 'politics', 'tech', 'entertainment']? Or there is some more faster way to check which class is mapped to which number, so we can know how to order them?
OK the answer for the correct label order is here:
#after this code: label_tokenizer = Tokenizer() label_tokenizer.fit_on_texts(labels) #you have to add this line: label_index = label_tokenizer.word_index
This will create the word index for the labels and then you can show them with their IDs with this commands:
res = list(sum(sorted(label_index.items(), key = lambda x:x[1]), ())) print (res)
Other thing that you might wonder is "Why" if we are training N classes, we need one more class, and the answer relies in the loss function (loss='sparse_categorical_crossentropy') where you have multiple labels but only one is the "correct answer", the outputs are mutually exclusive. It will need an extra CLASS for the case "there is no class" (or all the values are zero) you need a ONE "at the left" to make this a mutually exclusive case. Also in this case, softmax function is needed as it works better for mutually exclusive situations, as the sum of all the output classes will sum 1