Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Cleaning data of Evil prompts for safer model training

Open SummerSigh opened this issue 2 years ago • 1 comments
trafficstars

This is a potential plan for cleaning the Red Teaming data from Anthropics.

  • Step 1: Splitting data into Evil-Harmful and Harmless data.

This is fairly easy. The Anthropics dataset has a task_description_harmlessness_score (the lower the score the more harmful). Simply setting a threshold will give us fairly clean split of Evil-Harmful and Harmless data.

  • Step 2: Using RankGen to embed the Evil-Harmful data and Harmless data.

Using an embedding model such as RankGen will give us a solid method of embedding a large amount of data with high accuracy.

  • Step 3: Clustering the each class, and finding the mean emending in the two clusters.

By then clustering the data by the embedding, we determine the mean embedding which can then be used to classify new prompts via something like cosine similarity.

  • Step 4: Using the classifier to go through new prompts, finding misclassified prompts from a prompt dataset, and adding it to the Evil-Harmful, and Harmless datasets.

We then use a new prompt dataset (or maybe a spilt of the Anthropics dataset) and find misclassifications. We then add that back to the appropriate class in the emending dataset.

  • Step 5: Rinse and Repeat.

Using another dataset for evaluation (or again a spilt of the Anthropics dataset) will allow us to repeat the process until we get an accuracy that is satisfactory.

Benefits:

This method of using mean embedding for classification allows one model to classify many different tasks. If a prompt task dataset is made, making mean embeddings for things such a "Email Actions" or "Question Answering" may help us increase the accuracy of Open Assistant.

Since one embedding model is all that is needed to classify all these tasks, we can lower the overall compute overhead instead of making a new model every time.

Downsides:

Deciding on the threshold for the Anthropics dataset will have to be decided by human preference. Experimenting with different thresholds will be necessary.

Noisy data may also prove to be a problem with the Anthropics dataset. Since the task_description_harmlessness_score is "a real value score of the harmlessness of the task description (lower is more harmful) as obtained from a preference model". This preference model may have misclassified some of the data.

Im happy to help with this project in any way possible! im Summer#2406 on discord if you would like to ping me!

SummerSigh avatar Jan 04 '23 17:01 SummerSigh

Thank you @SummerSigh! Looking forward to wkring with you on this!

huu4ontocord avatar Jan 04 '23 17:01 huu4ontocord