Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Training data from Twitter

Open dhruv2601 opened this issue 1 year ago • 17 comments

Twitter is a good source for gathering multi-turn conversation training data. One approach is to convert any thread into A: B: A: format by linking user_id to the tweets. It has worked well in my experience.

I can prepare a plan, notebooks and share sample results(a few 1000 for a start) on what the data would look like, and further access quality over toxicity metrics on random samples. The data can be taken freely from twitterstream archives.

@yk what do you think?

Tasks:

  • [x] Evaluation of legality / usefulness / scope (discussion in this issue)
  • [x] Raw data set is published (or just a link to the data if it already exists)
  • [x] #620
  • [ ] Code run and data published to OpenAssistant Hugging Face

dhruv2601 avatar Dec 28 '22 19:12 dhruv2601

I would love to see this plan as someone trying to educate themselves.

degenai avatar Dec 29 '22 08:12 degenai

Would love to collaborate here. Lmk if there is task that can be split.

tusharagarwal25 avatar Dec 29 '22 11:12 tusharagarwal25

this looks like a neat plan. I think this would be really supercharged if we had some sort of "instruction/request detector", i.e. a model (or heuristic) that detects when the initial tweet is an instruction. I think this would reduce data noise by a huge amount. do you think that's a good plan?

I've created #143 for this. Do you want to work on that or do you already want to start here?

yk avatar Dec 29 '22 12:12 yk

@yk I think that's a good idea, a deberta-like model would be great for this task. It adapts to varied styles quickly helping us have a singular model to run on all different social media(style of conversation differs)/normal web-styled texts. Further, a task like instruction classifier it would need something like 10-15k high-quality samples and there is enough instruction-based data to get good positive and negative samples.

To be fair, heuristic-based would be much quicker to infer and therefore data collection would greatly increase given the huge amount of data we would need, but firstly I will keep it as an offline experiment to see how a few of those would work on a task like this and if there are certain patterns we could use.

I would like to start working on #143 first, you could assign it to me. I will propose a more detailed plan for that task here or on discord DM tonight.

dhruv2601 avatar Dec 29 '22 13:12 dhruv2601

amazing, thank you!

yk avatar Dec 29 '22 14:12 yk

you need to comment over there, I can't assign you otherwise 😄

yk avatar Dec 29 '22 14:12 yk

Done 👯 I'll circle back to this current issue of Twitter collection when done with 143.

dhruv2601 avatar Dec 29 '22 15:12 dhruv2601

Hi all, I have some experience dealing with twitter data science projects (had to load, filter, and process a TB of tweets), and finding suitable threads may be difficult or time-consuming to run the model to check due to the vast amounts of raw data. It might be possible to reduce the search space by filtering for certain hashtags or keywords like "help", "lazyweb", or other suitable ones. At the very least, it might help to get initial training data for the instruction model. Just an idea. I'm willing to help if I can.

Jmete avatar Dec 30 '22 12:12 Jmete

Hi all, I have some experience dealing with twitter data science projects (had to load, filter, and process a TB of tweets), and finding suitable threads may be difficult or time-consuming to run the model to check due to the vast amounts of raw data. It might be possible to reduce the search space by filtering for certain hashtags or keywords like "help", "lazyweb", or other suitable ones. At the very least, it might help to get initial training data for the instruction model. Just an idea. I'm willing to help if I can.

that's true. at least we could have multi-stage filters that optimize for recall, the first ones being very quick. Would you like to set up some code that would allow us to collect twitter data? We can then plug in the instruction detector once we have it.

yk avatar Dec 30 '22 21:12 yk

that's true. at least we could have multi-stage filters that optimize for recall, the first ones being very quick. Would you like to set up some code that would allow us to collect twitter data? We can then plug in the instruction detector once we have it.

Sure, I'll work on some initial code! I reviewed the data structures file, but I might need to follow up for the best way to store the conversation threads. For example, as json or in a relational db with parent-ids, or big csv files. The archive links stores them in .tar and inside are some compressed json files (.json.gz). Any feedback from experienced devs to optimize it is always helpful.

Jmete avatar Dec 31 '22 12:12 Jmete

I think we can stay with gzipped json-line files for now, they're very versatile, and we can process them into other formats easily.

yk avatar Dec 31 '22 15:12 yk

Hi @yk and others, I’ve been working on the twitter data. Some initial findings:

  • I downloaded the first archive as a test. Its some tar files consisting of 8640 jzon.gz files.

  • Each json file has around 2000 rows of data in it. After some pre-filtering to only get rows that have replies or are a reply to something else, and also non-truncated text, its about 500 rows per file. This seems rather consistent so far in this dump at least. This could definitely reduce if we use stricter filters like keywords and hashtags.

  • The tweet data isn’t very deep. It will mention that this tweet has replies, or which tweet it is a reply to, but it doesn’t seem to carry the whole thread. We need to create the conversation threads by piecing them together and matching ids from the bottom up.

  • There is no guarantee that the original / response are in the same json.gz file.

  • Due to this, we might need to process lots of them as unified files in order to try to match them. Right now I am doing some pre-filtering and then storing them in larger parquet files to save on the IO time of opening lots of little files. Doesn’t have to be parquet, but it makes it easy as a temporary solution and the file size is good. Every 512 files of processing is between 30 - 50MB which is for around 250k rows of tweets.

  • Alternatively, we could possibly highlight original tweets (the instruction / prompt) that have replies (possibly with some extra filters like keywords, non-truncated text, etc), and then use those tweet IDs to extract the entire thread using external APIs instead of trying to mine them from the existing dump. For now I’ve written scripts to create the easier to digest pre-filtered parquet files by looping through the file list, processing them, and then exporting the combined files.

Action items:

  • Decide on how to create the conversation threads from this raw data. We can also determine the final output file format too.
  • Decide on the filters to use. Other potential filters include certain hashtags, keywords, follower count of the people tweeting, language (English or all?), etc.
  • Do we have any storage location to store pre-processed files? For example, someone can run the scripts to create the parquet (or other format files), and then someone else can take those files and do extra processing on them.

Jmete avatar Dec 31 '22 22:12 Jmete

hey this is really nice work so far, thank you! with respect to how to get the conversations, I suggest you do what you feel is the most appropriate. seems like the choice is between a) keeping lots of stuff in ram to build the threads, b) spinning up a local db like redis to do that or c) query the API.

@lewtun do you have inputs on output format and storage?

as for keywords or hashtags, we might need to investigate this some more. one strategy is to search for hashtags or keywords that are commonly used when someone is requesting help with something , like #help. On the other hand, we could search for topical keywords such as math, sports, etc. and rely on the instruction detector to filter out the good instruction data. What do you think?

yk avatar Jan 01 '23 16:01 yk

Hi @yk @lewtun and others interested, I seem to have gotten the code working to turn the twitter dumps into conversation trees. Copying this post from discord. I have a jsonl file to share if anyone wants to look at it, can DM on discord.

More details: Step 1: I downloaded a few archive dumps and wrote code that took some standardized columns into parquet files for non-truncated tweets. This acts as a standardized dump that is easier to process. There is about 90M Tweets.

Step 2: I processed those parquet files into large dataframes and did some mix and matching to find out what tweets were origin tweets (just looking at english for now). Then wrote code to loop through the origin tweets and extract the conversation tree based on the origin node. I used the general tree and node class structure. Still missing some elements like metadata of the full tree, and I haven't combined prompt replies to themselves yet, they will show up as children, but I have identified them by their role as prompter or assistant.

Step 3: I exported that list of conversation trees into a jsonl file. There is 17,378 lines, each is a tree.

Notes:

  • This is just based on data that I extracted from the dump, not all possible tweets.
  • I feel a lot of the tweets are low quality / spam. We could run a future instruction model on them to try to determine quality I guess. I did try to explore and check for helpful hashtags or words but its either messy or not too many at all in the dataset even over millions of tweets.
  • We might need to try a direct separate scrape directly to twitter to search for types of tweets / hashtags we want to get higher quality.
  • I will try to clean up my code and update my github fork. Currently just running on local notebooks.

Jmete avatar Jan 07 '23 18:01 Jmete

thanks a lot for this update, this sounds very cool! don't worry about getting the data format exactly correct, having the data in any way is already great!

yk avatar Jan 07 '23 21:01 yk

Any update here?

andreaskoepf avatar May 05 '23 10:05 andreaskoepf

Hi @andreaskoepf , Apologies, I have been busy with work for the past month or so. I ran into some problems getting quality prompts out of it and there are issues with changes to how Twitter is operating recently. I can make some commits though soon for some extra scripts I have made to clean it and apply a detector to help remove some of them spam. We have 2 approaches:

  • Scraping archive files. This has a lot of data, but a lot of spam. The original commits focused on this.
  • Second approach is for scraping twitter rolled-up threads (from third-party), but then these would need to be run through a separate process to generate questions out of the paragraphs to form proper Q/A or instruct pairs.

After that, I would suggest either we consider it for later or if someone else can take the mantle of cleaning it up.

Jmete avatar May 05 '23 11:05 Jmete

Closing old data issue that has not been completed by now.

andreaskoepf avatar Jun 14 '23 08:06 andreaskoepf