Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

LLM Generalizable Safety Checker

Open puffy310 opened this issue 2 years ago • 4 comments
trafficstars

For now I'm writing here primarily to document my plan for a MVP. As suggested in the discord server, I plan to make a generalization safety checker. This could be done both as a prompt, or a filter using a contrastive model such as CARP. Something such as an NSFW filter shouldn't be too hard given a good prompt. If a more reliable method is needed, a synthetic dataset could be made to produce a softprompt or other token based tuning deliverable. The main problem is that Open-Assistant would need some way to interface with KoboldAI which might be a good base and partnership to build. I'm still working on an MVP at the moment, and I will leave it in the reply to this issue.

puffy310 avatar Jan 02 '23 21:01 puffy310

I'll assign you for now. Looking forward to a more detailed view on this. Also, quite unlikely that we'll start to heavily depend on some third-party API.

yk avatar Jan 02 '23 22:01 yk

I'm sure that there's a way to eliminate the requirement, I'll see if I can make a custom script to load the soft prompt.

puffy310 avatar Jan 02 '23 22:01 puffy310

Hey, there are some of us working on a similar issue over here: https://github.com/LAION-AI/Open-Assistant/issues/416 if you want to pop in!

smytjf11 avatar Jan 06 '23 18:01 smytjf11

@puffy310 - if we can gather data, we can train our model to respond approppriately given the warning instructions. Then we wouldn't need to use KoboldAI, assuming they permit us to use the data for that purpose. Also, of course we should credit them with the usage of this tool.

huu4ontocord avatar Jan 10 '23 05:01 huu4ontocord

Tumblr apparently uses a lot of trigger warnings. This article has an incomplete list of tags: https://trigger-warnings.tumblr.com/tags

It's technically for movies, but it appears that they are also tags on the main site.

ChristopherKing42 avatar Jan 12 '23 15:01 ChristopherKing42

Actually, even better dataset. Take all Tumblr posts tagged has having a content warning (warning: this link has "weird stuff" to say the least): https://www.tumblr.com/search/cw

Then the model would predict if a given tag goes with a given post (almost all the cw posts have other tags describing the type of content warning).

ChristopherKing42 avatar Jan 12 '23 16:01 ChristopherKing42

@ChristopherKing42 and @puffy310 - can you both work on this together! Very excited for your results.

huu4ontocord avatar Jan 12 '23 17:01 huu4ontocord

I think this has been superseded by more recent safety work incl. blade2blade

olliestanley avatar Apr 29 '23 21:04 olliestanley