text2vec
text2vec copied to clipboard
Multiple Errors Adapting GloVe Example to Project - Quanteda Related
Hello there,
I am having what I believe are multiple issues adapting the GloVe word embeddings tutorial to my project. I am starting with a tokens object created in Quanteda (TOK.Debates.2020.Full.Clean) to create the iterator. However, when I run that first line, I am greeted with this error:
Tokenizer_Debates_2020 = space_tokenizer(TOK.Debates.2020.Full.Clean)
_Warning message: In stringi::stri_split_fixed(strings, pattern = sep, ...) :
argument is not an atomic vector; coercing_
The tokenizer is created and looks like this:
I continue the example with no errors:
Iterator_Debates_2020 = itoken(Tokenizer_Debates_2020)
Vocab_Debates_2020 = create_vocabulary(Iterator_Debates_2020)
Vocab_Debates_2020 = prune_vocabulary(Vocab_Debates_2020, term_count_min = 10L)
Vectorizer_Debates_2020 = vocab_vectorizer(Vocab_Debates_2020)
TCM_Debates_2020 = create_tcm(Iterator_Debates_2020, Vectorizer_Debates_2020, skip_grams_window = 5L)
I check the dimensions of the TCM and see that I have rows and columns:
dim(TCM_Debates_2020)
_[1] 9277 9277_
I start to fit the model, creating the glove environment with no issue, but when I try to do the actual fitting I obtain the following error:
glove = GlobalVectors$new(rank = 50, x_max = 10)
WV_Debates_2020 = glove$fit_transform(TCM_Debates_2020, n_iter = 10, convergence_tol = 0.01, n_threads = 8)
_Error in if (cost/n_nnz > 1) stop("Cost is too big, probably something goes wrong... try smaller learning rate") :
missing value where TRUE/FALSE needed_
In order to troubleshoot this error, I have tried to do the following:
- Change the learning rate in the glove environment down to .001, still receive the same calculation cost error message
- Attempted to change the initial token object into a text file to simulate the example better, still receive the same coercion error
- Attempted to use a Quanteda FCM to replace the TCM, but receive the following error:
WV_Debates_2020 = glove$fit_transform(Debates2020.FCM, n_iter = 10, convergence_tol = 0.01, n_threads = 8)
_Error in glove$fit_transform(Debates2020.FCM, n_iter = 10, convergence_tol = 0.01, :
all(x@x > 0) is not TRUE_
I have been unable to proceed further and obviously one or more of these errors must be the culprit, but I have been unable to find documentation on these errors elsewhere, including past issues catalogued here.
Thank you in advance for any help in taking out this gremlin. -Sello
your Tokenizer_Debates_2020
looks like a list of words instead of a list of sequences of words
@jwijffels thank you for pointing that out, I've been trying to understand what the difference is, but I'm coming up short, unfortunately.
Would I avoid this problem if I tokenized the original corpus instead of a cleaned tokens item?
did you try Iterator_Debates_2020 = itoken(TOK.Debates.2020.Full.Clean, tokenizer = space_tokenizer)