Zero-Shot Learning in Modern NLP | Joe Davison Blog

State-of-the-art NLP models for text classification without annotated data

https://joeddav.github.io/blog/2020/05/29/ZSL.html

Jul 02 '20 20:07 utterances-bot

Nice blog - I only had time to skim through the high level of each method. Which method does the transformers pipeline use?

Aug 11 '20 21:08 aced125

Nice blog - I only had time to skim through the high level of each method. Which method does the transformers pipeline use?

Thanks! The pipeline uses the NLI method.

Aug 11 '20 22:08 joeddav

This article is brilliantly written!

Aug 17 '20 23:08 dshahrokhian

Thank you, perfect article. Could you please suggest most suitable way how to classify text (contains N sentences) to expected label?

Aug 21 '20 13:08 yurilla56

Thank you, amazing work. Can I see the code behind your online demo please?

Aug 26 '20 15:08 hishamkhrayzat51

@hishamkhrayzat51 Yeah the repo is here.

Aug 26 '20 16:08 joeddav

Hello, I'd like to kown on how many GPU your API for the Zero shot topic classification is running. Because, when trying to scan a 50 sentences text with 10 topics on Colab, it takes approximatively 5 minutes per text.... It looks like it's way faster on your web API though.

Thank you for your answer,

Clotilde

Oct 08 '20 09:10 clotildemiura

@clotildemiura It's slow if you're not on GPU since you have to run each text/candidate label pair through the model separately. If the web API is significantly faster, it's probably just because the results for examples you're looking at are cached. The web API is also just using CPU.

A few tips for speeding up the pipeline here.

Oct 08 '20 15:10 joeddav

thank you very much @joeddav

Oct 09 '20 08:10 clotildemiura

This is very interesting.

I had read two other papers on zero-shot learning sometime ago. The key ideas was:

Training a binary classifier to predict if (text, label) pair match or not: (paper, summary)

Training GPT-2 to generate the class given a multiple-choice question answer as prompt: (paper, summary)

Oct 10 '20 14:10 amitness

Really great article Joe! This will especially work for english text right? What would you advise for non english languages what don't have mnli datasets or nli trained bert models?

Oct 12 '20 10:10 gevezex

@gevezex Yep, I actually trained a model on a multilingual NLI dataset for this exact purpose! Tweet here: https://twitter.com/joeddav/status/1298997753075232772

Oct 13 '20 13:10 joeddav

Hey Joe, great article!

I have a silly question about this in the few-shot learning for the embedding approaches:

Take the top K most frequent words V in the vocabulary of a word2vec model

By the top K most frequent words, do you mean the top K from the corpus you are trying to classify?

Thanks for the multilingual NLI, btw!

Oct 26 '20 18:10 agombert

@agombert Glad you enjoyed it! Sorry, this was difficult to communicate. The format of word vector files typically orders the words by inverse frequency in the algorithm's train corpus. I meant the top K according to that ordering. So if you have a .vec file with 100k words (lines), just use the first K.

Oct 26 '20 18:10 joeddav

Wondering about using bigrams in candidate labels = ["not sustainable","climate change","environment pollution","government state policy","finance bank] wondering what happens - will these work. I think b-grams could add more context.

Oct 26 '20 20:10 dlmwright

Fantastic article!

Just a minor fix: the model name in the last code snippet should be facebook/bart-large-mnli.

Oct 30 '20 18:10 elderpinzon

Fascinating Article Joe Is there any resource available on how to fine-tune such models with our own Data? Thanks

Nov 05 '20 19:11 kk2211

Really great article keep it up

Nov 24 '20 03:11 sidharkal

Hi Joe, thanks for you article!! It is possible to fine-tune this models?

Dec 29 '20 10:12 mtortoli

@joeddav thanks or the article. I find it very helpful.

do you happen to have the notebook/code available for mapping from s-bert to word2vec? I wonder how it is done and also how you generate the word2vec embedding for phrases such as "Science and Mathematics". 🤔

Jan 17 '21 20:01 jackxxu

Hi thanks a lot for the article and notebook. Just have a quick question , what is the default model in the pipeline is it Bart MNLI?

Aug 05 '21 16:08 alisonreboud

Can you please show or direct me to a place where the fine-tuning is explained. I have about a 1000 sentences with their labels. I want to fine-tune this model on the task. During inference a subset of the labels will be used -- so zero shot learning would be the best way to go. But when you meant "pass the sentence twice, once with correct label and once with incorrect label while optimising cross-entropy", I want to see how that is done using HuggingFace.

Mar 04 '22 08:03 Boodhayana

As @Boodhayana said, I would also love to see the actual code that carries out the fine-tuning, I also have a data set that I want to fine tune the bart-mnli zero shot model on but can't find any examples of how to do so.

Mar 08 '22 21:03 kurah

Could you please post the code you used to finetune bart-large-mnli on Yahoo answers ?

Apr 28 '22 10:04 marouaghaouat

Regrettably, I failed to save that code. If you need to fine-tune, I recommend first distilling a classifier using this script, (https://github.com/huggingface/transformers/tree/main/examples/research_projects/zero-shot-distillation) and then fine-tuning the resulting model as you would any other classifier.

On Apr 28 2022, at 4:34 AM, Maroua Ghaouat @.***> wrote:

Could you please post the code you used to finetune bart-large-mnli on Yahoo answers ?

—

Reply to this email directly, view it on GitHub (https://github.com/joeddav/blog/issues/2#issuecomment-1112047791), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ACHLU2NCFWAFRMMLCYXM7ODVHJSVBANCNFSM4OPHI3AQ).

You are receiving this because you were mentioned.

Apr 28 '22 20:04 joeddav

@joeddav np at all. I am able to successfully fine tune the model. Your blog, and your answers in HuggingFace forums helped me a lot. I have one concern, however. Since I am using the fine-tuned model in production, i would need it to be fast(as fast as normal text classification ones). I have ~30 labels in my dataset. I am accelerating the inference time by using "onnxruntime" on the huggingface model that i fine-tune.

The code for 'onnx'-ing is below

python -m transformers.onnx --model=facebook/bart-large-mnli --feature=sequence-classification --atol=1e-04 dir/`

Even after that, the inference time for one piece of text takes almost 2seconds(it has to iterated through 30 labels).

Are there any methods to further fasten the inference?

Does distillation help? Any other methods that i can use along with this? I want to match the inference time taken by normal text classification.

Apr 29 '22 10:04 Boodhayana

@Boodhayana Distillation is exactly what you want. It will essentially train a student model, which is just a normal distilbert classifier, to mimic the predictions of the zero-shot teacher. You just need some example (unlabeled data).

Apr 29 '22 15:04 joeddav

@Boodhayana can you share or direct to place to understand how the fine tuning is actually done?

May 20 '22 08:05 tyatabe

@joeddav for distillation what should the candidate labels be? I think it should be the candidate labels you want to use for your application, regardless of what the text you're using for distillation is about. For example, if I want to train a model to classify movie summaries into genres, I could use the AG news data to distill a zero-shot model into a smaller one, using hypotheses labels like ['thriller', 'action', 'suspense', 'horror', 'comedy'], even though the AG news data has nothing to do with that. Then I could fine tune that distilled model with actual movie summary - genre data, right?

May 20 '22 11:05 tyatabe

Hey, thank you for getting back to me. I'm very excited to see that post! In the meantime I'm actually trying my hand with pytorch, and I'm wondering how to encode my labels. As suggested in the zero-shot learning blog post, I'm only using the labels entailment and contradiction, but I'm unsure what are the actual encodings used in the model. From this kaggle competition https://www.kaggle.com/competitions/contradictory-my-dear-watson I saw they're using 0, 1, or 2 (corresponding to entailment, neutral, and contradiction). Should I set up my encodings this way also? (0 for entailment and 2 for contradiction?)

Thank you,

Tada

On Sat, May 21, 2022 at 3:16 AM boodhayana @.***> wrote:

@Boodhayana https://github.com/Boodhayana Distillation is exactly what you want. It will essentially train a student model, which is just a normal distilbert classifier, to mimic the predictions of the zero-shot teacher. You just need some example (unlabeled data).

I plan to write a blog using a public dataset. So please wait a few days since i am using a private dataset that i cant share outside

— Reply to this email directly, view it on GitHub https://github.com/joeddav/blog/issues/2#issuecomment-1133491962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGIE2U6CWZPMGZU4XS3AAZTVLA2P5ANCNFSM4OPHI3AQ . You are receiving this because you commented.Message ID: @.***>

-- Tadaishi Yatabe R.

http://tadaishi.wixsite.com/tada http://tadaishi.wix.com/tada

May 23 '22 09:05 tyatabe