PyCon-Canada-2019-NLP-Tutorial icon indicating copy to clipboard operation
PyCon-Canada-2019-NLP-Tutorial copied to clipboard

BBC News_LSTM

Open jalbarracinv opened this issue 5 years ago • 9 comments

Hi Susan, just sharing that I needed to add this line for charts to appear:

import matplotlib.pyplot as plt

jalbarracinv avatar Jan 19 '20 17:01 jalbarracinv

Also, I added this lines in the beggining to make epochs run:

config = ConfigProto() config.gpu_options.allow_growth = True session = InteractiveSession(config=config)

jalbarracinv avatar Jan 19 '20 17:01 jalbarracinv

Other thing I just noticed: If I change the "wework example" for anything, like for example this txt: "Microsoft released a new technology for computers. Google and Apple released new smartphones today". The prediction does not change.

I printed the "padded" version of the txt and it "starts with 0's", counterintuitive.. as I expected it ended with 0's as the berlin article when padded. Leading zeroes:

[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269 409 1216 8 82 1636 959 995 1676 428 409 8 1 432]]

I changed the prediction "padded" with this:

padded = pad_sequences(seq, maxlen=max_length, padding=padding_type, truncating=trunc_type)

jalbarracinv avatar Jan 19 '20 18:01 jalbarracinv

After modifying the pad_sequences, I had to change the labels to this:

labels = ['none','sport', 'bussiness', 'politics', 'tech', 'entertainment']

adding "none" because prediction is a number from 0 to 5

jalbarracinv avatar Jan 19 '20 18:01 jalbarracinv

Right. labels should be ['none','sport', 'bussiness', 'politics', 'tech', 'entertainment']

The last layer outputed for labels 0, 1, 2, 3, 4, 5 although 0 has never been used.

tzutalin avatar Nov 09 '20 00:11 tzutalin

Right. labels should be ['none','sport', 'bussiness', 'politics', 'tech', 'entertainment']

The last layer outputed for labels 0, 1, 2, 3, 4, 5 although 0 has never been used.

I also dont understand how the labels are ordered in this list on the end where the prediction is happening and choosing one label from the array by argmax as index? Could someone please explain if knows. thanks

jburagev avatar Dec 27 '20 20:12 jburagev

Right. labels should be ['none','sport', 'bussiness', 'politics', 'tech', 'entertainment'] The last layer outputed for labels 0, 1, 2, 3, 4, 5 although 0 has never been used.

I also dont understand how the labels are ordered in this list on the end where the prediction is happening and choosing one label from the array by argmax as index? Could someone please explain if knows. thanks

Hi labels are converted to numbers using the Tokenizer() function. The tokenizer function chooses the numbers for the labels.

label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))
validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))

At the end the "argmax" function is used to select the top most value. Example, if the array is [0.0,0.2,0.9,0.4,0.3,0.7] the argmax will return 3 (the third item in the array is the topmost) which corresponds to "Politics".

jalbarracinv avatar Dec 27 '20 20:12 jalbarracinv

Right. labels should be ['none','sport', 'bussiness', 'politics', 'tech', 'entertainment'] The last layer outputed for labels 0, 1, 2, 3, 4, 5 although 0 has never been used.

I also dont understand how the labels are ordered in this list on the end where the prediction is happening and choosing one label from the array by argmax as index? Could someone please explain if knows. thanks

Hi labels are converted to numbers using the Tokenizer() function. The tokenizer function chooses the numbers for the labels.

label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))
validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))

At the end the "argmax" function is used to select the top most value. Example, if the array is [0.0,0.2,0.9,0.4,0.3,0.7] the argmax will return 3 (the third item in the array is the topmost) which corresponds to "Politics".

Ok its more clear now. So, you check manually how the tokenizer converted the classes to numbers and than ordered them such as ['none','sport', 'bussiness', 'politics', 'tech', 'entertainment']? Or there is some more faster way to check which class is mapped to which number, so we can know how to order them?

jburagev avatar Dec 27 '20 21:12 jburagev

OK the answer for the correct label order is here:

#after this code: label_tokenizer = Tokenizer() label_tokenizer.fit_on_texts(labels) #you have to add this line: label_index = label_tokenizer.word_index

This will create the word index for the labels and then you can show them with their IDs with this commands:

res = list(sum(sorted(label_index.items(), key = lambda x:x[1]), ())) print (res)

jalbarracinv avatar Feb 08 '21 05:02 jalbarracinv

Other thing that you might wonder is "Why" if we are training N classes, we need one more class, and the answer relies in the loss function (loss='sparse_categorical_crossentropy') where you have multiple labels but only one is the "correct answer", the outputs are mutually exclusive. It will need an extra CLASS for the case "there is no class" (or all the values are zero) you need a ONE "at the left" to make this a mutually exclusive case. Also in this case, softmax function is needed as it works better for mutually exclusive situations, as the sum of all the output classes will sum 1

jalbarracinv avatar Jul 15 '21 15:07 jalbarracinv