Open-Assistant
Open-Assistant copied to clipboard
Added files for scraping threads, and filtering original archive tweets
Added scripts #126 :
- Creating clean threads from scraping and processing "rolled" threads. Where people unroll a thread using a third-party. These are often higher quality tweet threads. However, they are not immediately ready for use since we will need to use another LLM to create QA pairs or instruct pairs based on them.
- Added a script to filter the original tweet threads from archive data using a custom model I trained to help remove spam and detect instructions. It is not perfect, and has not yet filtered the children of the trees since it is not detecting replies.
- Other general fixes to the readme and other elements.
As part of this PR can we maybe move scripts/data-collection/twitter
to a directory in data/datasets
? Really scripts/data-collection
and notebooks
directories are legacy