setfit
setfit copied to clipboard
Totally unreliable results. What I'm doing wrong?
I'm evaluating SetFit to predict one of 2 labels using ~500 training samples for both classes and results are very far from being satisfied.
Little bit background:
- I have an e-commerce website with fashion products for male and female customers in Poland (so polish language is being used)
- There are top level categories for both genders: Accessories, Underwear, Shoes, Clothing
- Each top level category has multiple more specific categories. For instance Clothing has: Jackets, T-Shirts, Trousers etc. ; This subcategories are not always identical across gender branches. For example high heels is only present in female branch
- I've build a synthetic training dataset by using specific category names with plural and singular values i.e. "Men's T-Shirt", "Men's T-shirts" resulting in ~500 rows (as it its fairly small I've attached it here) synthetic-train.csv
- Class counts for training dataset:
Counter({0: 333, 1: 217})
- as
eval_dataset
I'm using real world products, manually selected and verified. I've attached few testing rows here. test_sample.csv
I've trained the model using sentence-transformers/paraphrase-multilingual-mpnet-base-v2
and observed that model is never in doubt and predictions are totally useless. Next I've tried allegro/herbert-base-cased
with the following args and metrics:
train_args = TrainingArguments(
num_iterations=20,
seed=42,
)
Metrics:
accuracy: 0.7290384207174697
f1: 0.5987741631305987
recall: 0.43080054274084123
precision: 0.98145285935085
Lets see how it performs:
from utils import clean_text
class_labels = {v: k for k, v in gender_labels.items()}
# Inference
model = trainer.model
input_texts = ['lajsfhlasfuaer usiyfsf jsdfu', 'szpilki', 'sukienka', 'koszula', 'Spodnie męskie',
'Koszula damska', "frak czarny elegancki", "buty do garnituru", "nerka"]
for input_text, pred in zip(input_texts, model.predict_proba([clean_text(t) for t in input_texts])):
# Convert tensor to numpy array
pred = pred.detach().numpy()
# Get the index of the maximum probability which corresponds to the predicted class
predicted_class = np.argmax(pred)
# Get the maximum probability
predicted_probability = np.max(pred)
# Class to label mapping
predicted_class = class_labels[predicted_class].upper()
# Print the input text, predicted class and the probability of the predicted class (rounded to 3 decimal places)
print(f"{clean_text(input_text)} - {predicted_class}, Probability: {predicted_probability:.3f}")
lajsfhlasfuaer usiyfsf jsdfu - FEMALE, Probability: 1.000 (BAD; random string that does not mean anything!)
szpilki - FEMALE, Probability: 1.000 (OK, high heels, typical female product)
sukienka - FEMALE, Probability: 1.000 (OK; dress; typical female product)
koszula - FEMALE, Probability: 1.000 (BAD; "Shirt" can be both male and female)
spodnie męskie - MALE, Probability: 1.000 (OK, gender is mentioned)
koszula damska - FEMALE, Probability: 1.000 (OK, gender is mentioned)
frak czarny elegancki - FEMALE, Probability: 0.983, (BAD; "Black Elegant Tailcoat"; Tailcoat only mentioned in male data)
buty do garnituru - FEMALE, Probability: 1.000 (BAD; `shoes for a suit`; Suit mentioned for male only)
nerka - MALE, Probability: 1.000 (OK; typical male product in dataset)
What I'm doing wrong? Why model is so sure with it's predictions? What can I do to improve? Should training data look like sentence or punctuation does not matter at all?
Thank you for your help!
I've tried to change model head this way:
rom sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer
from setfit import SetFitModel
model_id = "allegro/herbert-base-cased"
# model = SetFitModel.from_pretrained(model_id)
model_body = SentenceTransformer(model_id)
model_head = LogisticRegression(class_weight="balanced")
model = SetFitModel(model_body=model_body, model_head=model_head)
and that resulted in:
{'accuracy': {'accuracy': 0.9120144343026958},
'f1': {'f1': 0.9101744501029364},
'recall': {'recall': 0.9497964721845319},
'precision': {'precision': 0.8737258165175785}}
and a little bit better test results: (same input as in original message)
lajsfhlasfuaer usiyfsf jsdfu - MALE, Probability: 0.999
szpilki - FEMALE, Probability: 0.665
sukienka - FEMALE, Probability: 0.995
koszula - FEMALE, Probability: 0.816
spodnie męskie - FEMALE, Probability: 0.990
koszula damska - FEMALE, Probability: 0.999
frak czarny elegancki - MALE, Probability: 0.997
buty do garnituru - MALE, Probability: 0.581
nerka - MALE, Probability: 0.792
but still its far away from being good.
you're getting great F1 and accuracy scores. why is it far from being good? what am I missing?