pytorch-widedeep
pytorch-widedeep copied to clipboard
problem on prediction stage
I am attempting to run the following code:
x_test = pandas.read_csv('test(1).csv')
x_test = text_preprocessor.transform(x_test)
predictions= trainer.predict(X_text=x_test)
where x_test consists of a single column with text descriptions
and I am getting the following output
predict: 75%|███████▌ | 3/4 [00:00<00:00, 12.85it/s] /usr/local/lib/python3.7/dist-packages/pytorch_widedeep/training/trainer.py in --> 581 return np.vstack(preds_l).squeeze(1)
ValueError: cannot select an axis to squeeze out which has size not equal to one
I am wondering how to go about resolving this. I have already tried expanding the dims of the x_test and tried resizing it but I am still getting the same issue
hey @lordfiftysix
Could you point me to some code where I can reproduce the error?
I am assuming you have train the trainer and all that right?
yes
Then if you could please point me to some code?
Otherwise, maybe I can try later with some dataset I might have and "report back" the results here :)
I suppose I am fine with the second option
here is some fully functioning code
from sklearn.model_selection import train_test_split
from pytorch_widedeep import Trainer
from pytorch_widedeep.datasets import load_womens_ecommerce
from pytorch_widedeep.models import BasicRNN, WideDeep
from pytorch_widedeep.preprocessing import TextPreprocessor
if __name__ == "__main__":
df = load_womens_ecommerce(as_frame=True)
# to be safe, but one can me more gentle here
df = df.dropna().reset_index(drop=True)
# just aesthetics
df.columns = [c.lower().replace(" ", "_") for c in df.columns]
# the reviews are a bit imbalanced, so we turned the problem into a binary
# classification
df["target"] = (df.rating >= 4).astype("int")
text_col = "review_text"
target = "target"
# train/test split
train, test = train_test_split(df, test_size=0.2, stratify=df.target)
# processing
text_processor = TextPreprocessor(text_col=text_col)
X_train = text_processor.fit_transform(train)
X_test = text_processor.transform(test)
# model definition. The model component needs to be wrap up with the
# WideDeep class
basic_rnn = BasicRNN(
vocab_size=len(text_processor.vocab.itos),
embed_dim=100,
hidden_dim=64,
n_layers=3,
bidirectional=True,
rnn_dropout=0.5,
padding_idx=1,
head_hidden_dims=[100, 50],
)
model = WideDeep(deeptext=basic_rnn, pred_dim=1)
# Train
trainer = Trainer(model, objective="binary")
trainer.fit(
X_text=X_train,
target=train[target].values,
n_epochs=1,
batch_size=256,
val_split=0.2,
)
# predict
preds = trainer.predict(X_text=X_test)
It did not work. I am trying to do multi-output regression. Here is some more of my code.
train_df = pd.read_csv('train.csv')
vectorizer = TfidfVectorizer(strip_accents=None,lowercase=False)
#word_vectors_path = "../tmp_data/glove.6B/glove.6B.100d.txt"
text_preprocessor = TextPreprocessor(text_col='full_text')
#print(train_df.head(0))
#tab_preprocessor = TabPreprocessor(['full_text'])
#tab_preprocessor = TabPreprocessor(['full_text'])
#print(x.head(0))
text_id = train_df['text_id']
#train_df = train_df.set_index('text_id')
train_df = train_df.drop(['text_id'],1)
train_df = train_df.dropna().reset_index(drop=True)
x= text_preprocessor.fit_transform(train_df)
#tfidf_vectorizer = TfidfVectorizer()
#x = vectorizer.fit_transform(train_df['full_text'])
#x = pd.DataFrame(x.todense())
#print(x)
#x = tab_preprocessor.fit_transform(x)#
print(x)
print(x.shape)
#print(train_df.head(0))
cols = cols# this is a series of 6 columns that go into the target
#text_id = train_df['text_id']
#y = train_df.drop(['full_text'],1)
#y = y.set_index('text_id')
#tab_preprocessor = TabPreprocessor(cols,shared_embed=False)#, crossed_cols=crossed_cols)
#ywide = tab_preprocessor.fit_transform(y)
#ywidee = y['cohesion']
#tab_preprocesso = TabPreprocessor(['cohesion'])
#ywidee = tab_preprocesso.fit_transform(y)
#print(tab_preprocessor.cat_embed_input)
#print(ywide)
#fast_model = TabMlp(tab_preprocessor.column_idx,tab_preprocessor.cat_embed_input)
#fast_model = TabFastFormer(tab_preprocessor.column_idx,tab_preprocessor.cat_embed_input)#column_idx=text_id, cat_embed_input=cat_embed_input tab_preprocessor.column_idx,
#rmodel = AttentiveRNN()
rmodel = AttentiveRNN(vocab_size=5741, embed_dim=80)
#print(y.shape)
#print(tab_preprocessor.cat_embed_input)
target = target
model = WideDeep(
#wide=wide,
deeptext=rmodel,pred_dim=6
#deeptabular=fast_model,
)
wd_trainer = Trainer(
model=model,
objective='rmse',#objective="rmse",
optimizers=torch.optim.AdamW(model.parameters(), lr=0.001),
#metrics=[Accuracy, Precision]
#metrics=[Accuracy, Precision],
)
#target = target #where target is a series of 6 columns with numerical int values
#print(ywide)
print(x)
xx=x
print(train_df)
wd_trainer.fit(X_text=x, target=train_df[cols].values, n_epochs=1, batch_size=1, val_split=0.2)
x_test = pd.read_csv('test(1).csv')
x_test = text_preprocessor.transform(x_test)
print(x_test.shape)
x_test = x_test.reshape(80,3)
#x_test = np.expand_dims(x_test,2)
print(x_test.shape)
df_pred = wd_trainer.predict(X_text=x_test)
print(df_pred)
And I am still getting the same error on the prediction stage
To do multi-output regression or multi-label classification we would need to modify the code.
In fact I don't know what the rmse value that outputs might be in your case, since the library is designed to work with targets of shape (N, 1), as it is written in the docs: "Losses in this module expect the predictions and ground truth to have the same dimensions for regression and binary classification problems (N_samples, 1) . In the case of multiclass classification problems the ground truth is expected to be a 1D tensor with the corresponding classes."
Anyway, if you can point me towards a notebook/colab with some small dataset or mock data would save me a lot of time. Otherwise I will try to mock some data myself and dig into this later
Hey I wonder if you were ever able to dig into this problem. I can confirm that i have 6 columns and a few thousand rows as my output so RMSE probably wont work. That being said I am struggling to do multi-output regression on these 6 target columns given a single input text column.
Hey, sorry @lordfiftysix
I am buried at work these days, sorry for the late reply.
No I did not have the time sorry 🙁.
maybe you could consider this as 6 independent problems? and then combine the losses?
Alternatively, maybe you could code a custom loss yourself? Although this might not be straightforward. See if I get a sec towards the end of the week. Alternatively I will see if @5uperpalo can look into it
@5uperpalo let's have a chat see if we can code a custom loss that takes multiple inputs and produces a single output