Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Added files for scraping threads, and filtering original archive tweets

Open Jmete opened this issue 1 year ago • 1 comments

Added scripts #126 :

  • Creating clean threads from scraping and processing "rolled" threads. Where people unroll a thread using a third-party. These are often higher quality tweet threads. However, they are not immediately ready for use since we will need to use another LLM to create QA pairs or instruct pairs based on them.
  • Added a script to filter the original tweet threads from archive data using a custom model I trained to help remove spam and detect instructions. It is not perfect, and has not yet filtered the children of the trees since it is not detecting replies.
  • Other general fixes to the readme and other elements.

Jmete avatar May 05 '23 12:05 Jmete

As part of this PR can we maybe move scripts/data-collection/twitter to a directory in data/datasets? Really scripts/data-collection and notebooks directories are legacy

olliestanley avatar May 05 '23 13:05 olliestanley