setfit Methodological error in zero cost, zero time, zero shot notebook

Hi,

I was looking at the zero cost, zero time, zero shot notebook for financial sentiment analysis (i.e., this one), and discovered a methodological error that invalidates the conclusions of the distillation section.

What happens is that the train and test dataframes, i.e., the CSV files loaded from Moritz Laurer's blog, are created by splitting the train split of the dataset (the dataset doesn't have a test split). Later on, when distilling, the authors of blog post reload the entire train split of the dataset, and then use this to distill the MLP. This means that the test data is also used to distill the model, which leads to a big overestimation of performance.

In my experiments, the original score PRF score I got was:

(array([0.85507246, 0.97348485, 0.94166667]),
 array([0.96721311, 0.96981132, 0.88976378]),
 array([0.90769231, 0.97164461, 0.91497976]),
 array([ 61, 265, 127]))

Which is close to the reported score in the article. If I instead remove the test data from the data used to distill the MLP, I get much lower scores:

(array([0.76785714, 0.87632509, 0.78947368]),
 array([0.70491803, 0.93584906, 0.70866142]),
 array([0.73504274, 0.90510949, 0.74688797]),
 array([ 61, 265, 127]))

These scores are much lower than the reported scores, and also much lower than the LLM scores, which invalidates the conclusion of the notebook and article. Note that these scores are still a bit higher than the scores you would get when just directly optimizing cross entropy, so you could argue that the point still makes sense.

If you want I can do a PR on the notebook.

Apr 20 '24 13:04 stephantul

@MosheWasserb

Apr 20 '24 18:04 tomaarsen

Hi @tomaarsen, Sorry miss your message :( Great catch. Yes, go ahead and issue a PR

May 28 '24 07:05 MosheWasserb

Hey @MosheWasserb ,

Thanks for replying, really appreciated.

Before I submit a PR, could we maybe discuss what you want the final conclusion of the article to look like? Because the part after you reload the dataset doesn't work any more. Should I just remove those parts?

Jun 13 '24 18:06 stephantul