Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Create an instruction-detector

Open yk opened this issue 2 years ago • 20 comments

There is lots of conversational data on the web, for example twitter, reddit, etc. yet only a tiny fraction of it starts with some sort of instruction or request for a task to be fulfilled. We need a system, either a model or a heuristic (or a combination) to classify text as "instruction-like", which would allow us to harvest data from a wide variety of places.

yk avatar Dec 29 '22 12:12 yk

Hi @yk, let's do it. From the discussion on #126 I'll be taking this task and proposing a step-by-step solution either tonight or tomorrow.

dhruv2601 avatar Dec 29 '22 15:12 dhruv2601

Hi @yk, let's do it. From the discussion on #126 I'll be taking this task and proposing a step-by-step solution either tonight or tomorrow.

Hey hii Can i also collaborate??

rohanpatankar926 avatar Dec 29 '22 19:12 rohanpatankar926

Hi @rohanpatankar926 sure. I'm not sure what the best way to go is for that. A suggestion: we let @dhruv2601 come up with an initial implementation, and then iterate on that? @dhruv2601 might then also switch back to #126 while you can improve the detector.

yk avatar Dec 29 '22 20:12 yk

Hi, if possible I would like to help with this and/or #126

MattiaSangermano avatar Dec 29 '22 20:12 MattiaSangermano

Hi @MattiaSangermano thanks for the interest. See my suggestion above, would that work for you?

yk avatar Dec 29 '22 21:12 yk

Yes, at this point I think it's the best way to proceed

MattiaSangermano avatar Dec 29 '22 22:12 MattiaSangermano

@yk hi, I'd also hope to contribute. I read discussions above and think it makes a lot of sense.

totuta avatar Dec 29 '22 23:12 totuta

Could zero-shot classification be a solution? "facebook/bart-large-mnli" on HF gives a >0.7 score for @yk's initial post being a request :)

agoryuno avatar Jan 01 '23 13:01 agoryuno

Could zero-shot classification be a solution? "facebook/bart-large-mnli" on HF gives a >0.7 score for @yk's initial post being a request :)

yes it's probably viable to build an ensemble of things like this. depends on how far one can get the noise down

yk avatar Jan 01 '23 19:01 yk

Hi @dhruv2601 , I have written scripts based on 126 to process tweets into conversation threads. If any model has been trained to detect useful instructions, we could then run it on that file to filter it. If you need the file, I can send it to you via discord. I will also update my fork of the repo soon with the code to do all the processing if anyone wants to download dumps and try from their side.

Jmete avatar Jan 09 '23 18:01 Jmete

@dhruv2601 any updates on this?

yk avatar Jan 10 '23 11:01 yk

Hey all and @yk, I've trained a model for this task and it works well. Currently, I am working on testing the model on data other than the validation, i.e. on all kinds of instruction styles possible, and I'm taking the help of GPT-JT and ChatGPT for this. It becomes an iterative process when I discover new instruction styles and add them to training data, and repeat.

The action item currently is to prepare a final model, upload it to HF and create a model card and data collection process. Hopefully, I'll update again in a couple of days.

dhruv2601 avatar Jan 10 '23 11:01 dhruv2601

@dhruv2601 thanks a lot for the update. is it possible that you check in the code for this somewhere in the repo under e.g. /model/instruction_detector/ and come to our discord (ping me there) to give a bit of regular updates on it? the issue is, we need to know very accurately what's in the model, both in terms of code and data, in order to use it.

yk avatar Jan 10 '23 12:01 yk

@dhruv2601 Did you use just the twitter data for the model you've trained or used additional datasets?

lakshaykc avatar Jan 11 '23 00:01 lakshaykc

Hi, due to work and school deadlines, I have been a bit delayed in updating this task. Plan to be more active in a couple of days.

dhruv2601 avatar Jan 12 '23 17:01 dhruv2601

@dhruv2601 would love to test your model. ping in data channel in discord.

huu4ontocord avatar Jan 14 '23 08:01 huu4ontocord

@dhruv2601 checking on this again.

huu4ontocord avatar Jan 22 '23 04:01 huu4ontocord

Hello @dhruv2601 , is there any update on the instruction detector model?

Jmete avatar Jan 28 '23 16:01 Jmete

@yk , seems like this task is better to be accelerated. Though @dhruv2601 is already on this, may I spend some time on a minimal viable example?

totuta avatar Feb 15 '23 08:02 totuta

This issue stalled, not sure what the relevance is, I remove it from the project board for now.

andreaskoepf avatar May 05 '23 10:05 andreaskoepf