pytorch-widedeep icon indicating copy to clipboard operation
pytorch-widedeep copied to clipboard

problem on prediction stage

Open lordfiftysix opened this issue 3 years ago • 9 comments

I am attempting to run the following code:

x_test = pandas.read_csv('test(1).csv')

x_test = text_preprocessor.transform(x_test)

predictions= trainer.predict(X_text=x_test)

where x_test consists of a single column with text descriptions

and I am getting the following output

predict: 75%|███████▌ | 3/4 [00:00<00:00, 12.85it/s] /usr/local/lib/python3.7/dist-packages/pytorch_widedeep/training/trainer.py in --> 581 return np.vstack(preds_l).squeeze(1)

ValueError: cannot select an axis to squeeze out which has size not equal to one

I am wondering how to go about resolving this. I have already tried expanding the dims of the x_test and tried resizing it but I am still getting the same issue

lordfiftysix avatar Oct 19 '22 07:10 lordfiftysix

hey @lordfiftysix

Could you point me to some code where I can reproduce the error?

I am assuming you have train the trainer and all that right?

jrzaurin avatar Oct 19 '22 07:10 jrzaurin

yes

lordfiftysix avatar Oct 19 '22 07:10 lordfiftysix

Then if you could please point me to some code?

Otherwise, maybe I can try later with some dataset I might have and "report back" the results here :)

jrzaurin avatar Oct 19 '22 07:10 jrzaurin

I suppose I am fine with the second option

lordfiftysix avatar Oct 19 '22 07:10 lordfiftysix

here is some fully functioning code

from sklearn.model_selection import train_test_split

from pytorch_widedeep import Trainer
from pytorch_widedeep.datasets import load_womens_ecommerce
from pytorch_widedeep.models import BasicRNN, WideDeep
from pytorch_widedeep.preprocessing import TextPreprocessor

if __name__ == "__main__":

    df = load_womens_ecommerce(as_frame=True)

    # to be safe, but one can me more gentle here
    df = df.dropna().reset_index(drop=True)

    # just aesthetics
    df.columns = [c.lower().replace(" ", "_") for c in df.columns]

    # the reviews are a bit imbalanced, so we turned the problem into a binary
    # classification
    df["target"] = (df.rating >= 4).astype("int")
    text_col = "review_text"
    target = "target"

    # train/test split
    train, test = train_test_split(df, test_size=0.2, stratify=df.target)

    # processing
    text_processor = TextPreprocessor(text_col=text_col)
    X_train = text_processor.fit_transform(train)
    X_test = text_processor.transform(test)

    # model definition. The model component needs to be wrap up with the
    # WideDeep class
    basic_rnn = BasicRNN(
        vocab_size=len(text_processor.vocab.itos),
        embed_dim=100,
        hidden_dim=64,
        n_layers=3,
        bidirectional=True,
        rnn_dropout=0.5,
        padding_idx=1,
        head_hidden_dims=[100, 50],
    )
    model = WideDeep(deeptext=basic_rnn, pred_dim=1)

    # Train
    trainer = Trainer(model, objective="binary")
    trainer.fit(
        X_text=X_train,
        target=train[target].values,
        n_epochs=1,
        batch_size=256,
        val_split=0.2,
    )

    # predict
    preds = trainer.predict(X_text=X_test)

jrzaurin avatar Oct 19 '22 14:10 jrzaurin

It did not work. I am trying to do multi-output regression. Here is some more of my code.

train_df = pd.read_csv('train.csv')
vectorizer = TfidfVectorizer(strip_accents=None,lowercase=False)
#word_vectors_path = "../tmp_data/glove.6B/glove.6B.100d.txt"
text_preprocessor = TextPreprocessor(text_col='full_text')
#print(train_df.head(0))
#tab_preprocessor = TabPreprocessor(['full_text'])
#tab_preprocessor = TabPreprocessor(['full_text'])
#print(x.head(0))
text_id = train_df['text_id']
#train_df = train_df.set_index('text_id')
train_df = train_df.drop(['text_id'],1)
train_df = train_df.dropna().reset_index(drop=True)
x= text_preprocessor.fit_transform(train_df)

#tfidf_vectorizer = TfidfVectorizer()

#x = vectorizer.fit_transform(train_df['full_text'])
#x = pd.DataFrame(x.todense())
#print(x)
#x = tab_preprocessor.fit_transform(x)#

print(x)
print(x.shape)

#print(train_df.head(0))
cols = cols# this is a series of 6 columns that go into the target
#text_id = train_df['text_id']
#y = train_df.drop(['full_text'],1)

#y = y.set_index('text_id')
#tab_preprocessor = TabPreprocessor(cols,shared_embed=False)#, crossed_cols=crossed_cols)
#ywide = tab_preprocessor.fit_transform(y)
#ywidee = y['cohesion']
#tab_preprocesso = TabPreprocessor(['cohesion'])
#ywidee = tab_preprocesso.fit_transform(y)
#print(tab_preprocessor.cat_embed_input)
#print(ywide)
#fast_model = TabMlp(tab_preprocessor.column_idx,tab_preprocessor.cat_embed_input)
#fast_model =  TabFastFormer(tab_preprocessor.column_idx,tab_preprocessor.cat_embed_input)#column_idx=text_id, cat_embed_input=cat_embed_input tab_preprocessor.column_idx,
#rmodel = AttentiveRNN()
rmodel = AttentiveRNN(vocab_size=5741, embed_dim=80) 
#print(y.shape)
#print(tab_preprocessor.cat_embed_input)
target = target
model = WideDeep(
    
    #wide=wide,
    deeptext=rmodel,pred_dim=6
    #deeptabular=fast_model,

) 

wd_trainer = Trainer(
    model=model,
    objective='rmse',#objective="rmse",
    optimizers=torch.optim.AdamW(model.parameters(), lr=0.001),
    #metrics=[Accuracy, Precision]
    #metrics=[Accuracy, Precision],
)

#target = target #where target is a series of 6 columns with numerical int values
#print(ywide)
print(x)
xx=x
print(train_df)
wd_trainer.fit(X_text=x, target=train_df[cols].values, n_epochs=1, batch_size=1, val_split=0.2)

x_test = pd.read_csv('test(1).csv')

x_test = text_preprocessor.transform(x_test)

print(x_test.shape)
x_test = x_test.reshape(80,3)

#x_test = np.expand_dims(x_test,2)
print(x_test.shape)
df_pred = wd_trainer.predict(X_text=x_test)
print(df_pred)

And I am still getting the same error on the prediction stage

lordfiftysix avatar Oct 22 '22 06:10 lordfiftysix

To do multi-output regression or multi-label classification we would need to modify the code.

In fact I don't know what the rmse value that outputs might be in your case, since the library is designed to work with targets of shape (N, 1), as it is written in the docs: "Losses in this module expect the predictions and ground truth to have the same dimensions for regression and binary classification problems (N_samples, 1) . In the case of multiclass classification problems the ground truth is expected to be a 1D tensor with the corresponding classes."

Anyway, if you can point me towards a notebook/colab with some small dataset or mock data would save me a lot of time. Otherwise I will try to mock some data myself and dig into this later

jrzaurin avatar Oct 24 '22 08:10 jrzaurin

Hey I wonder if you were ever able to dig into this problem. I can confirm that i have 6 columns and a few thousand rows as my output so RMSE probably wont work. That being said I am struggling to do multi-output regression on these 6 target columns given a single input text column.

lordfiftysix avatar Nov 07 '22 04:11 lordfiftysix

Hey, sorry @lordfiftysix

I am buried at work these days, sorry for the late reply.

No I did not have the time sorry 🙁.

maybe you could consider this as 6 independent problems? and then combine the losses?

Alternatively, maybe you could code a custom loss yourself? Although this might not be straightforward. See if I get a sec towards the end of the week. Alternatively I will see if @5uperpalo can look into it

@5uperpalo let's have a chat see if we can code a custom loss that takes multiple inputs and produces a single output

jrzaurin avatar Nov 14 '22 11:11 jrzaurin

Hey @lordfiftysix

Better late than never, check this release

:)

jrzaurin avatar Jun 15 '24 15:06 jrzaurin