Open-Assistant
Open-Assistant copied to clipboard
LLM Generalizable Safety Checker
For now I'm writing here primarily to document my plan for a MVP. As suggested in the discord server, I plan to make a generalization safety checker. This could be done both as a prompt, or a filter using a contrastive model such as CARP. Something such as an NSFW filter shouldn't be too hard given a good prompt. If a more reliable method is needed, a synthetic dataset could be made to produce a softprompt or other token based tuning deliverable. The main problem is that Open-Assistant would need some way to interface with KoboldAI which might be a good base and partnership to build. I'm still working on an MVP at the moment, and I will leave it in the reply to this issue.
I'll assign you for now. Looking forward to a more detailed view on this. Also, quite unlikely that we'll start to heavily depend on some third-party API.
I'm sure that there's a way to eliminate the requirement, I'll see if I can make a custom script to load the soft prompt.
Hey, there are some of us working on a similar issue over here: https://github.com/LAION-AI/Open-Assistant/issues/416 if you want to pop in!
@puffy310 - if we can gather data, we can train our model to respond approppriately given the warning instructions. Then we wouldn't need to use KoboldAI, assuming they permit us to use the data for that purpose. Also, of course we should credit them with the usage of this tool.
Tumblr apparently uses a lot of trigger warnings. This article has an incomplete list of tags: https://trigger-warnings.tumblr.com/tags
It's technically for movies, but it appears that they are also tags on the main site.
Actually, even better dataset. Take all Tumblr posts tagged has having a content warning (warning: this link has "weird stuff" to say the least): https://www.tumblr.com/search/cw
Then the model would predict if a given tag goes with a given post (almost all the cw posts have other tags describing the type of content warning).
@ChristopherKing42 and @puffy310 - can you both work on this together! Very excited for your results.
I think this has been superseded by more recent safety work incl. blade2blade