Open-Assistant OA Developer Meeting

Last meeting #3321

spam, bots, and data quality for inference and RLHF
- found this old issue #914

Jun 16 '23 06:06 AbdBarho

High Priority:

System prompt prefix and initial prompt categorization tasks: Should include language, task categorization, and other tags. Example would be like <|system|> lang:en, task:coding, tag:python <|prompter|> ...
Review system design to clean up existing data: should Include edit proposal + annotation system. In works: https://github.com/LAION-AI/Open-Assistant/pull/3289
Pause on English data collection when review system is implemented to focus on review against a static, non-moving target, as current English data contribution recently have too much spam. Release data at pause point as Oasst1.1.

Medium Priority:

A clear, consistent labeling guideline, as the previous RLHF results isn't ideal. Proposal for review: https://github.com/LAION-AI/Open-Assistant/issues/2893, can add "potentially synthetic" tag as well.
Design regular dataset release cadence for future. Maybe every two weeks?
Liberapay/Open Collective setup for funding.

Low Priority

Dataset Language localization: zh-hant and zh-hans conversion should be easy as there are no grammar differences and there are non-LLM libraries that can do it efficiently already.
Set up a Lemmy instance for "Ask Open Assistant" as an alternative to Reddit to get more realistic human/bot interaction data/feedback.

Jun 16 '23 19:06 yuechen-li-dev

Message categorization is also useful for more strict quality ratings since it would mean we could direct people who are more knowledgeable about a given subject to label and respond to those messages.

Jun 17 '23 00:06 someone13574

Meeting notes

Inference System:

Separate the inference from the website
uses its own auth system
Email may not be feasible.

Preprompt:

Language selection should be available, while other features not.
The effectiveness of the pre-prompt is not clear yet.
Investigate implementing a tagging system for completed dialogue trees
- use small model to automatically tag trees?

Review:

We have the issue of repetitive or low-quality responses because of copy / paste the chatGPT
Implement a new task to propose edits to remove data pollution.
Consider utilizing domain-specific prompts and inputs to leverage user expertise.
Consider making open assistant data collection system available to specific groups of experts (as an isolated system in that domain)
- universities or companies as an example
- being visible and known makes you less likely to spam

Data quality English:

maybe stop collecting new data for english, since the quality is going down, and again, consider tasks for editing and improving tasks.

Jun 20 '23 19:06 AbdBarho

Open-Assistant Open-Assistant copied to clipboard

OA Developer Meeting

Open-Assistant
Open-Assistant copied to clipboard