Open-Assistant Switch to filtered prosocial dataset

This PR modifies ProsocialDialogue to use the filtered version of the dataset I have created, with less irrelevant data and less rejections.

In this modified dataset I have filtered out the mostly irrelevant lines where the safety label is "casual" and "possibly/probably needs caution", which I have found to be mostly pointless, as well as some lines where the phrasing of the response might hurt the model's performance by refusing to act on a request.

This is an alternative solution that may work instead of removing the dataset completely, as mentioned in #3144

May 14 '23 19:05 TCLProject

Relevant #3144

May 14 '23 19:05 olliestanley

Thanks a lot for working on prosocial, we got some negative comments for SFT-8 (not deployed yet) which used 15% of prosocial-dialog and had an unfiltered version of gpt4all. The discussion of the dataset mixture for SFT-9 is still very much at the beginning.

May 15 '23 07:05 andreaskoepf

(@TCLProject if you want to help us determining the OA SFT-9 dataset mix, please contact Ollie or me via DM on discord .. almostEvil___ is coordinating the SFT-9 project.)

May 15 '23 07:05 andreaskoepf

Thanks!

May 15 '23 16:05 echo0x22

One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in data/datasets/ in this repo, if it's possible

May 16 '23 09:05 olliestanley

One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in data/datasets/ in this repo, if it's possible

I apologize for the late reply. As for the code: it was done through a commandline and I've cleared bash history since then, but it was done using a couple of grep commands I could probably get close to replicating (the exact filtered words).

As for the duplicates: This is a point of confusion for me too, believe it or not. I appear to have misunderstood how HF datasets work. The train.json in the original dataset is roughly 85 MB. The train.json (which I thought is what matters) is roughly 40MB in the filtered dataset that I have uploaded. HF seems to have combined it with all the other json files, which is not the intended behavior and I do not understand why it acted that way (and I do apologize). The other json files are there to have a dataset at different points of filtering (e.g. with possibly but no casual, with probably but no word filtering, etc). I did not intend on all of the files somehow being merged together and would appreciate a pointer on why that happened and how I can prevent it from happening.

Jun 08 '23 20:06 TCLProject

One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in data/datasets/ in this repo, if it's possible

I apologize for the late reply. As for the code: it was done through a commandline and I've cleared bash history since then, but it was done using a couple of grep commands I could probably get close to replicating (the exact filtered words).

As for the duplicates: This is a point of confusion for me too, believe it or not. I appear to have misunderstood how HF datasets work. The train.json in the original dataset is roughly 85 MB. The train.json (which I thought is what matters) is roughly 40MB in the filtered dataset that I have uploaded. HF seems to have combined it with all the other json files, which is not the intended behavior and I do not understand why it acted that way (and I do apologize). The other json files are there to have a dataset at different points of filtering (e.g. with possibly but no casual, with probably but no word filtering, etc). I did not intend on all of the files somehow being merged together and would appreciate a pointer on why that happened and how I can prevent it from happening.

If you upload multiple JSONs, by default loading the HF dataset will combine them all unless a specific one is given as an argument to the load_dataset() call. I made a change in a follow-up PR to add this argument so it's no longer an issue

Jun 08 '23 20:06 olliestanley

If you upload multiple JSONs, by default loading the HF dataset will combine them all unless a specific one is given as an argument to the load_dataset() call. I made a change in a follow-up PR to add this argument so it's no longer an issue

Good to know, Thank you!

Jun 08 '23 20:06 TCLProject

Open-Assistant Open-Assistant copied to clipboard

Switch to filtered prosocial dataset

Open-Assistant
Open-Assistant copied to clipboard