setfit Totally unreliable results. What I'm doing wrong?

I'm evaluating SetFit to predict one of 2 labels using ~500 training samples for both classes and results are very far from being satisfied.

Little bit background:

I have an e-commerce website with fashion products for male and female customers in Poland (so polish language is being used)
There are top level categories for both genders: Accessories, Underwear, Shoes, Clothing
Each top level category has multiple more specific categories. For instance Clothing has: Jackets, T-Shirts, Trousers etc. ; This subcategories are not always identical across gender branches. For example high heels is only present in female branch
I've build a synthetic training dataset by using specific category names with plural and singular values i.e. "Men's T-Shirt", "Men's T-shirts" resulting in ~500 rows (as it its fairly small I've attached it here) synthetic-train.csv
Class counts for training dataset: Counter({0: 333, 1: 217})
as eval_dataset I'm using real world products, manually selected and verified. I've attached few testing rows here. test_sample.csv

I've trained the model using sentence-transformers/paraphrase-multilingual-mpnet-base-v2 and observed that model is never in doubt and predictions are totally useless. Next I've tried allegro/herbert-base-cased with the following args and metrics:

train_args = TrainingArguments(
    num_iterations=20,
    seed=42,
)

Metrics:

accuracy: 0.7290384207174697
f1: 0.5987741631305987
recall: 0.43080054274084123
precision: 0.98145285935085

Lets see how it performs:

from utils import clean_text

class_labels = {v: k for k, v in gender_labels.items()}

# Inference
model = trainer.model
input_texts = ['lajsfhlasfuaer usiyfsf jsdfu', 'szpilki', 'sukienka', 'koszula', 'Spodnie męskie',
               'Koszula damska', "frak czarny elegancki", "buty do garnituru", "nerka"]

for input_text, pred in zip(input_texts, model.predict_proba([clean_text(t) for t in input_texts])):
    # Convert tensor to numpy array
    pred = pred.detach().numpy()

    # Get the index of the maximum probability which corresponds to the predicted class
    predicted_class = np.argmax(pred)

    # Get the maximum probability
    predicted_probability = np.max(pred)

    # Class to label mapping
    predicted_class = class_labels[predicted_class].upper()

    # Print the input text, predicted class and the probability of the predicted class (rounded to 3 decimal places)
    print(f"{clean_text(input_text)} - {predicted_class}, Probability: {predicted_probability:.3f}")

lajsfhlasfuaer usiyfsf jsdfu - FEMALE, Probability: 1.000 (BAD; random string that does not mean anything!)
szpilki - FEMALE, Probability: 1.000 (OK, high heels, typical female product)
sukienka - FEMALE, Probability: 1.000 (OK; dress; typical female product)
koszula - FEMALE, Probability: 1.000 (BAD; "Shirt" can be both male and female)
spodnie męskie - MALE, Probability: 1.000 (OK, gender is mentioned)
koszula damska - FEMALE, Probability: 1.000 (OK, gender is mentioned)
frak czarny elegancki - FEMALE, Probability: 0.983, (BAD; "Black Elegant Tailcoat"; Tailcoat only mentioned in male data)
buty do garnituru - FEMALE, Probability: 1.000 (BAD; `shoes for a suit`; Suit mentioned for male only)
nerka - MALE, Probability: 1.000 (OK; typical male product in dataset)

What I'm doing wrong? Why model is so sure with it's predictions? What can I do to improve? Should training data look like sentence or punctuation does not matter at all?

Thank you for your help!

Feb 12 '24 04:02 pySilver

I've tried to change model head this way:

rom sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer
from setfit import SetFitModel

model_id = "allegro/herbert-base-cased"
# model = SetFitModel.from_pretrained(model_id)

model_body = SentenceTransformer(model_id)
model_head = LogisticRegression(class_weight="balanced")
model = SetFitModel(model_body=model_body, model_head=model_head)

and that resulted in:

{'accuracy': {'accuracy': 0.9120144343026958},
 'f1': {'f1': 0.9101744501029364},
 'recall': {'recall': 0.9497964721845319},
 'precision': {'precision': 0.8737258165175785}}

and a little bit better test results: (same input as in original message)

lajsfhlasfuaer usiyfsf jsdfu - MALE, Probability: 0.999
szpilki - FEMALE, Probability: 0.665
sukienka - FEMALE, Probability: 0.995
koszula - FEMALE, Probability: 0.816
spodnie męskie - FEMALE, Probability: 0.990
koszula damska - FEMALE, Probability: 0.999
frak czarny elegancki - MALE, Probability: 0.997
buty do garnituru - MALE, Probability: 0.581
nerka - MALE, Probability: 0.792

but still its far away from being good.

Feb 12 '24 04:02 pySilver

you're getting great F1 and accuracy scores. why is it far from being good? what am I missing?

Jun 15 '24 08:06 rolandtannous

setfit setfit copied to clipboard

Totally unreliable results. What I'm doing wrong?

setfit
setfit copied to clipboard