Open-Assistant
                                
                                
                                
                                    Open-Assistant copied to clipboard
                            
                            
                            
                        Switch to filtered prosocial dataset
This PR modifies ProsocialDialogue to use the filtered version of the dataset I have created, with less irrelevant data and less rejections.
In this modified dataset I have filtered out the mostly irrelevant lines where the safety label is "casual" and "possibly/probably needs caution", which I have found to be mostly pointless, as well as some lines where the phrasing of the response might hurt the model's performance by refusing to act on a request.
This is an alternative solution that may work instead of removing the dataset completely, as mentioned in #3144
Relevant #3144
Thanks a lot for working on prosocial, we got some negative comments for SFT-8 (not deployed yet) which used 15% of prosocial-dialog and had an unfiltered version of gpt4all. The discussion of the dataset mixture for SFT-9 is still very much at the beginning.
(@TCLProject if you want to help us determining the OA SFT-9 dataset mix, please contact Ollie or me via DM on discord .. almostEvil___ is coordinating the SFT-9 project.)
Thanks!
One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in data/datasets/ in this repo, if it's possible
One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in
data/datasets/in this repo, if it's possible
I apologize for the late reply. As for the code: it was done through a commandline and I've cleared bash history since then, but it was done using a couple of grep commands I could probably get close to replicating (the exact filtered words).
As for the duplicates: This is a point of confusion for me too, believe it or not. I appear to have misunderstood how HF datasets work. The train.json in the original dataset is roughly 85 MB. The train.json (which I thought is what matters) is roughly 40MB in the filtered dataset that I have uploaded. HF seems to have combined it with all the other json files, which is not the intended behavior and I do not understand why it acted that way (and I do apologize). The other json files are there to have a dataset at different points of filtering (e.g. with possibly but no casual, with probably but no word filtering, etc). I did not intend on all of the files somehow being merged together and would appreciate a pointer on why that happened and how I can prevent it from happening.
One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in
data/datasets/in this repo, if it's possibleI apologize for the late reply. As for the code: it was done through a commandline and I've cleared bash history since then, but it was done using a couple of grep commands I could probably get close to replicating (the exact filtered words).
As for the duplicates: This is a point of confusion for me too, believe it or not. I appear to have misunderstood how HF datasets work. The train.json in the original dataset is roughly 85 MB. The train.json (which I thought is what matters) is roughly 40MB in the filtered dataset that I have uploaded. HF seems to have combined it with all the other json files, which is not the intended behavior and I do not understand why it acted that way (and I do apologize). The other json files are there to have a dataset at different points of filtering (e.g. with possibly but no casual, with probably but no word filtering, etc). I did not intend on all of the files somehow being merged together and would appreciate a pointer on why that happened and how I can prevent it from happening.
If you upload multiple JSONs, by default loading the HF dataset will combine them all unless a specific one is given as an argument to the load_dataset() call. I made a change in a follow-up PR to add this argument so it's no longer an issue
If you upload multiple JSONs, by default loading the HF dataset will combine them all unless a specific one is given as an argument to the
load_dataset()call. I made a change in a follow-up PR to add this argument so it's no longer an issue
Good to know, Thank you!