Open-Assistant
Open-Assistant copied to clipboard
OA Developer Meeting
Last meeting #3321
- spam, bots, and data quality for inference and RLHF
- found this old issue #914
High Priority:
- System prompt prefix and initial prompt categorization tasks: Should include language, task categorization, and other tags. Example would be like
<|system|> lang:en, task:coding, tag:python <|prompter|> ... - Review system design to clean up existing data: should Include edit proposal + annotation system. In works: https://github.com/LAION-AI/Open-Assistant/pull/3289
- Pause on English data collection when review system is implemented to focus on review against a static, non-moving target, as current English data contribution recently have too much spam. Release data at pause point as Oasst1.1.
Medium Priority:
- A clear, consistent labeling guideline, as the previous RLHF results isn't ideal. Proposal for review: https://github.com/LAION-AI/Open-Assistant/issues/2893, can add "potentially synthetic" tag as well.
- Design regular dataset release cadence for future. Maybe every two weeks?
- Liberapay/Open Collective setup for funding.
Low Priority
- Dataset Language localization: zh-hant and zh-hans conversion should be easy as there are no grammar differences and there are non-LLM libraries that can do it efficiently already.
- Set up a Lemmy instance for "Ask Open Assistant" as an alternative to Reddit to get more realistic human/bot interaction data/feedback.
Message categorization is also useful for more strict quality ratings since it would mean we could direct people who are more knowledgeable about a given subject to label and respond to those messages.
Meeting notes
Inference System:
- Separate the inference from the website
- uses its own auth system
- Email may not be feasible.
Preprompt:
- Language selection should be available, while other features not.
- The effectiveness of the pre-prompt is not clear yet.
- Investigate implementing a tagging system for completed dialogue trees
- use small model to automatically tag trees?
Review:
- We have the issue of repetitive or low-quality responses because of copy / paste the chatGPT
- Implement a new task to propose edits to remove data pollution.
- Consider utilizing domain-specific prompts and inputs to leverage user expertise.
- Consider making open assistant data collection system available to specific groups of experts (as an isolated system in that domain)
- universities or companies as an example
- being visible and known makes you less likely to spam
Data quality English:
- maybe stop collecting new data for english, since the quality is going down, and again, consider tasks for editing and improving tasks.