Open-Assistant
Open-Assistant copied to clipboard
Data quality filter for augmented data
We are gathering or are creating alot of augmented data in the form of instruction->answer pairs. We want to be able to filter out good ones from bad ones. Some possible ways is to create examples and create a classsifer. Otherways include running embedidng models such as rankgen to filter. Yet other ways include perplexity fitler and toxicity filters. We can also cluster and remove outliers (possibly - outliers of answers for similar questions). We need a prototype pipeline for doing this to help filter our data into a higher quality dataset.
To extend further, the pipeline can be applied to each dataset individually and all datasets as an aggregated one. Functionalities: Score:
- leverage multiple reward models (our own, others in hf, with different specialty)!
- leverage multiple safety model
- give a diversity score against rest of training data
- give a perplexity score (somehow indicating that its not similar to training data)
- lang detect
Removal:
- dedup
- identify PII
- identify gibberish
- etc...
Looks really good! Thank you. Looking forward to your results!
I had thought more about it, and started writing some interface, and some quick implementation. Please let me know if you have any comment. I am going to propose a config class soon as it will be important to make the whole pipeline reproducible and configurable.
Proposed Pipeline:
Config -> ScorerPipeline -> FilterPipeline (Removal) -> ClusteringPipeline -> Human QC/ label -> Some proxy data quality model training -> Good data -> Next Iteration
Some principles (WIP)
- extensible to different scorers/ clustering in whatever framework/ models, so everyone can contribute)
- reproducible and traceable via configuration management (which scorer, filter, clustering config)
- as part of experiment artifact and we will have a lot of iteration
- for meta analysis (like does diversity result in better model? does more reward model result in better model?)
- as such, will also need data versioning too
Abstract class:
from abc import ABC, abstractmethod
from typing import List
class ScorerBase(ABC):
score_type = "base"
@abstractmethod
def score(self, instructans_list):
pass
Example:
scorer_pipeline = ScorerPipeline()
scorer_pipeline.add([
RewardModelScorer("OpenAssistant/reward-model-deberta-v3-large"),
PerplexityScorer("gpt2"),
ToxicityScorer("unitary/toxic-bert"),
GibberishScorer("madhurjindal/autonlp-Gibberish-Detector-492513457"),
LengthScorer()]
)
result = scorer_pipeline.score(data_list)
print(result[0])
{'instruct': 'Can you suggest how to handle long layovers in foreign countries briefly?',
'answer': "Do your research. Learn a few phrases of the official language. Get some cash. Do some quick study before landing. Find a restroom away from the crowd. Know where you need to board your next flight. Keep your belongings secure, especially if you're traveling through a crowded airport.",
'score': [{'model_id': 'OpenAssistant/reward-model-deberta-v3-large',
'score': 0.6610293388366699,
'score_type': 'reward'},
{'model_id': 'gpt2',
'score': 44.386924743652344,
'score_type': 'perplexity'},
{'model_id': 'unitary/toxic-bert',
'score': [{'label': 'toxic', 'score': 0.0008366688853129745},
{'label': 'insult', 'score': 0.00018980714958161116},
{'label': 'obscene', 'score': 0.00015945330960676074},
{'label': 'identity_hate', 'score': 0.00015700122457928956},
{'label': 'threat', 'score': 0.0001366952055832371},
{'label': 'severe_toxic', 'score': 0.00011524109868332744}],
'score_type': 'toxicity'},
{'model_id': 'madhurjindal/autonlp-Gibberish-Detector-492513457',
'score': 0.029495954513549805,
'score_type': 'gibberish'},
{'model_id': None, 'score': 359, 'score_type': 'length'}]}
Oh - this looks really good :) maybe also a basic len cutoff in the beginning too?
@kenhktsui a few questions here. For FilterPipeline (Removal), is this only removing based on what you put in the remove functionality (dedup, PII, etc)? We also want to remove based on scores right (e.g. if toxicity is above a certain threshold). I'm also interested in collaborating with you on this if it would be helful!
@pruksmhc Thanks for your question! Yes the FilterPipeline can filter everything that we score in the ScorerPipeline based on absolute statistics and relative statistics. So its flexible enough to include most things and is extensible. I am done designing the interface, am setting up config management now, will expect to release the first version in the coming few days. At the same time, I am incorporating a lot of useful code from @ontocord. We could collaborate in that area 😃
Hey hey - thank you for the great work. Can you do me a favor and provide a link to my code when copying or deriving any code from my repo and provide this copyright notice somewhere to indicate where it comes from? thanks. i should eventually just make a pypi for my riverbed code, but that has to wait until we get OA launched :) thank you!
# coding=utf-8
# Copyright 2021-2023, Ontocord, LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Closing old data issue.