pytorch-widedeep icon indicating copy to clipboard operation
pytorch-widedeep copied to clipboard

'TextPreprocessor' object has no attribute 'embedding_matrix'

Open ajkdrag opened this issue 1 year ago • 6 comments

I was trying to use text processor as mentioned in the docs/examples, but I get this error. I couldn't find this attribute in the TextPreprocessor class.

ajkdrag avatar Feb 20 '24 15:02 ajkdrag

Hi @ajkdrag which example? and I will have a look :)

jrzaurin avatar Feb 20 '24 15:02 jrzaurin

Hi @ajkdrag

It occurs to me that to use an embedding matrix you need to have it stored in disk. Some of the examples in the library use the Glove vectors. You would need to have them in your machine and then pass them to the TextProcessor via the word_vectors_path param.

Hope this helps!

jrzaurin avatar Feb 21 '24 08:02 jrzaurin

You are correct. I was trying to use the baseRNN without any pretrained embeddings. Also, a quick suggestion from you on my usecase would help. I am trying to run a binary classification where I have a few continuous features and a text column Payee. The payee column can have strings like: "Mr. John Doe", "Nike Association" "Oakland LIP" (which are entity names, basically payee names on a bank cheque), for which the corresponding target is 1. On the other hand there are also strings in that column, like: "fifty dollars" , "pay to the order of", "p.o. box 1100/33" "Bank of panorama" etc which are not valid payee names and have a corresponding target of 0. Inherently this is like a text classification/entity classification with additional non text features. My question is, is it okay to use Glove embeddings for this task? Should I tune the embeddings on my dataset (300 rows is all i have) ? Or should I use BaseRNN with learnable embeddings? P.S. apologies for the long text , but any suggestion would help.

ajkdrag avatar Feb 21 '24 09:02 ajkdrag

Hi @ajkdrag

Well...in my view I would suggest to build features out of the text columns such as

number of tokens number of numbers number of letters number of non alphanumeric characters presence of some token/char association etc

and build a tabular data out of these plus the other columns, which I would suggest to then plug into a GBM of your like (CatBoost, LightGBM or XGBoost) and perhaps forget about DL for this problem :)

If you choose to use DL, I would suggest that you tokenize at character level (and/or char bigrams and trigrams) and use an RNN with learnable embeddings.

If I were you I would start with the first option :)

jrzaurin avatar Feb 21 '24 14:02 jrzaurin

Is there a way to use gbm with wide-deep framework?

ajkdrag avatar Feb 21 '24 14:02 ajkdrag

what I was suggesting was to NOT use Deep Learning at all! :)

"standard" feature engineering and a GBM

jrzaurin avatar Feb 22 '24 17:02 jrzaurin