Open-Assistant
Open-Assistant copied to clipboard
harvest Stack Exchange Q&A data
https://stackexchange.com/sites
After reading a discussion here about how to build a model that detects "instruction-like" conversations on Twitter, I was wondering about data sources that are in instruction format by definition. I found no discussion on Stack Exchange in this repo. So this thread is to explore the possibility of using data from the numerous Stack Exchange sites.
good idea, could be neat to get data from there. keep in mind that the main goal is task diversity, so we'd need to focus on that when scraping any external source, but SE in itself is already very diverse, so pretty promising.
before scraping, there are some online dumps of this that might prove useful. here is an example: https://www.kaggle.com/datasets/stackoverflow/stackoverflow?select=posts_answers
Also have a look at https://data.stackexchange.com before scraping.
Specifically, what information do we want from the StackExchange websites? I assume at least:
- question title
- question body
- question votes
- list of
- answer body
- answer vote
Are there other pieces of information that should be collected? Do we also want comments on the questions/answers?
Is the idea that OpenAssistant would be trained to produce the "answer body" conditioned on the "question title" + "question body?"
Perhaps the answers' votes could be used to train the reward model.
I'm curious to hear someone else's thoughts.
any thought given to only extracting excepted answers? Or just do what the pile did and run justext on the html?
I might be able to jump on this? Anyone know how aggressive SE is on blocking crawlers, or I ca take a look how much coverage is in CC.
The pile also has a fairly large piece that's from SE. Do we think it'll be beneficial to have more in the training data? Anyways, probably a good idea to look how the SE data is formatted in the pile
https://github.com/EleutherAI/stackexchange-dataset, plus it's already annonomized with PPI removed.
I currently have a dataset I was putting together for this type of usage. 74 different Stack Exchange sources, post text, top rated answers, ratings, as well as extracted keywords. It's about 592k rows (152k questions, 440k answers) prefiltered by only questions with accepted answers and above a certain score, and I have it in parquet and JSONL with question/all answers on each line.
JSONL looks like this

Stack exchange uploads it's dumps already
https://archive.org/details/stackexchange
That covers 367 sub stacks
Also of similar use is this reddit dump: https://files.pushshift.io/reddit/
Anthropic used this exact data to pre-train their RLHF models: https://arxiv.org/abs/2112.00861
The above paper covers ablations on the best way to pre-train for downstream tasks using such data, finding additional language model pre-training isn't useful, and that instead binary discrimination should be preferred.
thanks @batelicm this is very valuable!
Hey all, happy to work on this issue. I can probably help with the Pile on the data side. I'll take a look at it a bit more and look for the things we've already talked about wanting to extract.
Is there some kind of "getting started" guide on how to make data sources like this compatible and useful for the project?
Is there some kind of "getting started" guide on how to make data sources like this compatible and useful for the project?
@lewtun do we already have that?
also, have a look at https://github.com/LAION-AI/Open-Assistant/pull/288 (not merged yet)
Added a notebook #355 (not merged yet) to ingest Stack Exchange from the data dumps. At the end I added the Open-Assistant Data Scheme from #288
https://download.kiwix.org/zim/stack_exchange/
This has a ton of stack exchanges. They are in zim format which works with Kiwix.
https://www.kiwix.org/
The 7z's linked above by @batelicm might work better but this could serve as a backup if that doesn't pan out.
Also note, you can go up a folder level to get access to more zim files from all over the place, beyond stack exchanges, to include sources such as Gutenberg, Wikibooks, and more.
https://download.kiwix.org/zim/
Another source to watch is Academic Torrents. It could help with sourcing more data.
@smytjf11 re the pile, there are actually a lot of good things we can extract from there - mathmatics instructions, coding instructions, patent title to summary and summary to title (you could simulate a conversation about writing a patent by parsing the summaries into sentences. they are often written as the main parts of the patents). ping in the discord and we can discuss more.
Is there some kind of "getting started" guide on how to make data sources like this compatible and useful for the project?
we don't have a format yet, and at this stage it's more important to gather the data. my format has been [instruction->response, instruction->response, ...] if possible, we would also like at the end something like [instruction -> [(response1, score1), (response2, score2), etc]
Added a notebook #355 (not merged yet) to ingest Stack Exchange from the data dumps. At the end I added the Open-Assistant Data Scheme from #288
cc @Vechtomov. @b-mc2 have we exported stack exchange into dialog format yet on hf?
Do you think that using the score associated to each answer as reward signal for RL would be a good idea? Or is it better to use this data for finetuning?
(Hello, first post here, great project!)
Howdy, first time contributer here. I've done some scraping and data manipulation before. Pretty decet with Python. Somebody pls tell me how i begin. Thanks!
Hey @Free-Radical. It looks like a notebook to convert StackExchange data to Open Assistant format was already created by @b-mc2 and merged, see here: https://github.com/LAION-AI/Open-Assistant/pull/355.
I think the next step would be to discuss, whether there are benefits in further massaging the data, then it will need to be converted into parquet files and uploaded to HF (see https://projects.laion.ai/Open-Assistant/docs/data/datasets).
cool, pls tell me next steps. Or if you think you can use me elsewhere pls let me know, also how do u guys do DM's when necessary? PS where could i see a review of the whole project?
I gave it a go and after some fights with 100GB XML files I have prepared a PR https://github.com/LAION-AI/Open-Assistant/pull/2848