Open-Assistant harvest Stack Exchange Q&A data

https://stackexchange.com/sites

After reading a discussion here about how to build a model that detects "instruction-like" conversations on Twitter, I was wondering about data sources that are in instruction format by definition. I found no discussion on Stack Exchange in this repo. So this thread is to explore the possibility of using data from the numerous Stack Exchange sites.

Dec 31 '22 09:12 clstaudt

good idea, could be neat to get data from there. keep in mind that the main goal is task diversity, so we'd need to focus on that when scraping any external source, but SE in itself is already very diverse, so pretty promising.

Dec 31 '22 14:12 yk

before scraping, there are some online dumps of this that might prove useful. here is an example: https://www.kaggle.com/datasets/stackoverflow/stackoverflow?select=posts_answers

Dec 31 '22 14:12 matan1905

Also have a look at https://data.stackexchange.com before scraping.

Dec 31 '22 18:12 clstaudt

Specifically, what information do we want from the StackExchange websites? I assume at least:

question title
question body
question votes
list of
- answer body
- answer vote

Are there other pieces of information that should be collected? Do we also want comments on the questions/answers?

Is the idea that OpenAssistant would be trained to produce the "answer body" conditioned on the "question title" + "question body?"

Perhaps the answers' votes could be used to train the reward model.

I'm curious to hear someone else's thoughts.

Jan 01 '23 00:01 billray0259

any thought given to only extracting excepted answers? Or just do what the pile did and run justext on the html?

I might be able to jump on this? Anyone know how aggressive SE is on blocking crawlers, or I ca take a look how much coverage is in CC.

Jan 01 '23 01:01 getorca

The pile also has a fairly large piece that's from SE. Do we think it'll be beneficial to have more in the training data? Anyways, probably a good idea to look how the SE data is formatted in the pile

https://github.com/EleutherAI/stackexchange-dataset, plus it's already annonomized with PPI removed.

Jan 01 '23 01:01 getorca

I currently have a dataset I was putting together for this type of usage. 74 different Stack Exchange sources, post text, top rated answers, ratings, as well as extracted keywords. It's about 592k rows (152k questions, 440k answers) prefiltered by only questions with accepted answers and above a certain score, and I have it in parquet and JSONL with question/all answers on each line.

JSONL looks like this

Jan 01 '23 14:01 b-mc2

Stack exchange uploads it's dumps already

https://archive.org/details/stackexchange

That covers 367 sub stacks

Also of similar use is this reddit dump: https://files.pushshift.io/reddit/

Anthropic used this exact data to pre-train their RLHF models: https://arxiv.org/abs/2112.00861

The above paper covers ablations on the best way to pre-train for downstream tasks using such data, finding additional language model pre-training isn't useful, and that instead binary discrimination should be preferred.

Jan 01 '23 16:01 batelicm

thanks @batelicm this is very valuable!

Jan 01 '23 20:01 yk

Hey all, happy to work on this issue. I can probably help with the Pile on the data side. I'll take a look at it a bit more and look for the things we've already talked about wanting to extract.

Jan 02 '23 13:01 smytjf11

Is there some kind of "getting started" guide on how to make data sources like this compatible and useful for the project?

Jan 02 '23 13:01 clstaudt

Is there some kind of "getting started" guide on how to make data sources like this compatible and useful for the project?

@lewtun do we already have that?

also, have a look at https://github.com/LAION-AI/Open-Assistant/pull/288 (not merged yet)

Jan 02 '23 21:01 yk

Added a notebook #355 (not merged yet) to ingest Stack Exchange from the data dumps. At the end I added the Open-Assistant Data Scheme from #288

Jan 04 '23 13:01 b-mc2

https://download.kiwix.org/zim/stack_exchange/

This has a ton of stack exchanges. They are in zim format which works with Kiwix.

https://www.kiwix.org/

The 7z's linked above by @batelicm might work better but this could serve as a backup if that doesn't pan out.

Also note, you can go up a folder level to get access to more zim files from all over the place, beyond stack exchanges, to include sources such as Gutenberg, Wikibooks, and more.

https://download.kiwix.org/zim/

Another source to watch is Academic Torrents. It could help with sourcing more data.

Jan 05 '23 02:01 Kinvert

@smytjf11 re the pile, there are actually a lot of good things we can extract from there - mathmatics instructions, coding instructions, patent title to summary and summary to title (you could simulate a conversation about writing a patent by parsing the summaries into sentences. they are often written as the main parts of the patents). ping in the discord and we can discuss more.

Jan 05 '23 18:01 huu4ontocord

Is there some kind of "getting started" guide on how to make data sources like this compatible and useful for the project?

we don't have a format yet, and at this stage it's more important to gather the data. my format has been [instruction->response, instruction->response, ...] if possible, we would also like at the end something like [instruction -> [(response1, score1), (response2, score2), etc]

Jan 05 '23 18:01 huu4ontocord

Added a notebook #355 (not merged yet) to ingest Stack Exchange from the data dumps. At the end I added the Open-Assistant Data Scheme from #288

cc @Vechtomov. @b-mc2 have we exported stack exchange into dialog format yet on hf?

Feb 04 '23 22:02 huu4ontocord

Do you think that using the score associated to each answer as reward signal for RL would be a good idea? Or is it better to use this data for finetuning?

(Hello, first post here, great project!)

Feb 14 '23 09:02 TommasoBendinelli

Howdy, first time contributer here. I've done some scraping and data manipulation before. Pretty decet with Python. Somebody pls tell me how i begin. Thanks!

Apr 12 '23 07:04 Free-Radical

Hey @Free-Radical. It looks like a notebook to convert StackExchange data to Open Assistant format was already created by @b-mc2 and merged, see here: https://github.com/LAION-AI/Open-Assistant/pull/355.

I think the next step would be to discuss, whether there are benefits in further massaging the data, then it will need to be converted into parquet files and uploaded to HF (see https://projects.laion.ai/Open-Assistant/docs/data/datasets).

Apr 12 '23 09:04 stefnnn

cool, pls tell me next steps. Or if you think you can use me elsewhere pls let me know, also how do u guys do DM's when necessary? PS where could i see a review of the whole project?

Apr 14 '23 17:04 Free-Radical

I gave it a go and after some fights with 100GB XML files I have prepared a PR https://github.com/LAION-AI/Open-Assistant/pull/2848

Apr 23 '23 07:04 stefnnn

Open-Assistant Open-Assistant copied to clipboard

harvest Stack Exchange Q&A data

Open-Assistant
Open-Assistant copied to clipboard