pytorch-widedeep
pytorch-widedeep copied to clipboard
'TextPreprocessor' object has no attribute 'embedding_matrix'
I was trying to use text processor as mentioned in the docs/examples, but I get this error. I couldn't find this attribute in the TextPreprocessor class.
Hi @ajkdrag which example? and I will have a look :)
Hi @ajkdrag
It occurs to me that to use an embedding matrix you need to have it stored in disk. Some of the examples in the library use the Glove vectors. You would need to have them in your machine and then pass them to the TextProcessor
via the word_vectors_path
param.
Hope this helps!
You are correct. I was trying to use the baseRNN without any pretrained embeddings. Also, a quick suggestion from you on my usecase would help. I am trying to run a binary classification where I have a few continuous features and a text column Payee. The payee column can have strings like: "Mr. John Doe", "Nike Association" "Oakland LIP" (which are entity names, basically payee names on a bank cheque), for which the corresponding target is 1. On the other hand there are also strings in that column, like: "fifty dollars" , "pay to the order of", "p.o. box 1100/33" "Bank of panorama" etc which are not valid payee names and have a corresponding target of 0. Inherently this is like a text classification/entity classification with additional non text features. My question is, is it okay to use Glove embeddings for this task? Should I tune the embeddings on my dataset (300 rows is all i have) ? Or should I use BaseRNN with learnable embeddings? P.S. apologies for the long text , but any suggestion would help.
Hi @ajkdrag
Well...in my view I would suggest to build features out of the text columns such as
number of tokens number of numbers number of letters number of non alphanumeric characters presence of some token/char association etc
and build a tabular data out of these plus the other columns, which I would suggest to then plug into a GBM of your like (CatBoost, LightGBM or XGBoost) and perhaps forget about DL for this problem :)
If you choose to use DL, I would suggest that you tokenize at character level (and/or char bigrams and trigrams) and use an RNN with learnable embeddings.
If I were you I would start with the first option :)
Is there a way to use gbm with wide-deep framework?
what I was suggesting was to NOT use Deep Learning at all! :)
"standard" feature engineering and a GBM