Open-Assistant Data quality filter for augmented data

We are gathering or are creating alot of augmented data in the form of instruction->answer pairs. We want to be able to filter out good ones from bad ones. Some possible ways is to create examples and create a classsifer. Otherways include running embedidng models such as rankgen to filter. Yet other ways include perplexity fitler and toxicity filters. We can also cluster and remove outliers (possibly - outliers of answers for similar questions). We need a prototype pipeline for doing this to help filter our data into a higher quality dataset.

Jan 14 '23 08:01 huu4ontocord

To extend further, the pipeline can be applied to each dataset individually and all datasets as an aggregated one. Functionalities: Score:

leverage multiple reward models (our own, others in hf, with different specialty)!
leverage multiple safety model
give a diversity score against rest of training data
give a perplexity score (somehow indicating that its not similar to training data)
lang detect

Removal:

dedup
identify PII
identify gibberish
etc...

Jan 17 '23 16:01 kenhktsui

Looks really good! Thank you. Looking forward to your results!

Jan 17 '23 16:01 huu4ontocord

I had thought more about it, and started writing some interface, and some quick implementation. Please let me know if you have any comment. I am going to propose a config class soon as it will be important to make the whole pipeline reproducible and configurable.

Proposed Pipeline:

Config -> ScorerPipeline -> FilterPipeline (Removal) -> ClusteringPipeline -> Human QC/ label -> Some proxy data quality model training -> Good data -> Next Iteration

Some principles (WIP)

extensible to different scorers/ clustering in whatever framework/ models, so everyone can contribute)
reproducible and traceable via configuration management (which scorer, filter, clustering config)
- as part of experiment artifact and we will have a lot of iteration
- for meta analysis (like does diversity result in better model? does more reward model result in better model?)
- as such, will also need data versioning too

Abstract class:

from abc import ABC, abstractmethod
from typing import List


class ScorerBase(ABC):
    score_type = "base"
    
    @abstractmethod
    def score(self, instructans_list):
        pass

Example:

scorer_pipeline = ScorerPipeline()
scorer_pipeline.add([
    RewardModelScorer("OpenAssistant/reward-model-deberta-v3-large"),
    PerplexityScorer("gpt2"),
    ToxicityScorer("unitary/toxic-bert"),
    GibberishScorer("madhurjindal/autonlp-Gibberish-Detector-492513457"),
    LengthScorer()]
)
result = scorer_pipeline.score(data_list)
print(result[0])
{'instruct': 'Can you suggest how to handle long layovers in foreign countries briefly?',
 'answer': "Do your research. Learn a few phrases of the official language. Get some cash. Do some quick study before landing. Find a restroom away from the crowd. Know where you need to board your next flight. Keep your belongings secure, especially if you're traveling through a crowded airport.",
 'score': [{'model_id': 'OpenAssistant/reward-model-deberta-v3-large',
   'score': 0.6610293388366699,
   'score_type': 'reward'},
  {'model_id': 'gpt2',
   'score': 44.386924743652344,
   'score_type': 'perplexity'},
  {'model_id': 'unitary/toxic-bert',
   'score': [{'label': 'toxic', 'score': 0.0008366688853129745},
    {'label': 'insult', 'score': 0.00018980714958161116},
    {'label': 'obscene', 'score': 0.00015945330960676074},
    {'label': 'identity_hate', 'score': 0.00015700122457928956},
    {'label': 'threat', 'score': 0.0001366952055832371},
    {'label': 'severe_toxic', 'score': 0.00011524109868332744}],
   'score_type': 'toxicity'},
  {'model_id': 'madhurjindal/autonlp-Gibberish-Detector-492513457',
   'score': 0.029495954513549805,
   'score_type': 'gibberish'},
  {'model_id': None, 'score': 359, 'score_type': 'length'}]}

Jan 18 '23 16:01 kenhktsui

Oh - this looks really good :) maybe also a basic len cutoff in the beginning too?

Jan 18 '23 17:01 huu4ontocord

@kenhktsui a few questions here. For FilterPipeline (Removal), is this only removing based on what you put in the remove functionality (dedup, PII, etc)? We also want to remove based on scores right (e.g. if toxicity is above a certain threshold). I'm also interested in collaborating with you on this if it would be helful!

Jan 23 '23 00:01 pruksmhc

@pruksmhc Thanks for your question! Yes the FilterPipeline can filter everything that we score in the ScorerPipeline based on absolute statistics and relative statistics. So its flexible enough to include most things and is extensible. I am done designing the interface, am setting up config management now, will expect to release the first version in the coming few days. At the same time, I am incorporating a lot of useful code from @ontocord. We could collaborate in that area 😃

Jan 23 '23 04:01 kenhktsui

Hey hey - thank you for the great work. Can you do me a favor and provide a link to my code when copying or deriving any code from my repo and provide this copyright notice somewhere to indicate where it comes from? thanks. i should eventually just make a pypi for my riverbed code, but that has to wait until we get OA launched :) thank you!

# coding=utf-8
# Copyright 2021-2023, Ontocord, LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Jan 23 '23 05:01 huu4ontocord

Closing old data issue.

Jun 14 '23 08:06 andreaskoepf

Open-Assistant Open-Assistant copied to clipboard

Data quality filter for augmented data

Proposed Pipeline:

Some principles (WIP)

Open-Assistant
Open-Assistant copied to clipboard