Open-Assistant
Open-Assistant copied to clipboard
dataset: BigScience Biomedical Datasets
I was looking into adding some datasets from the BigBio repository, and I had some questions before proceeding.
- Galactica was trained on a subset of the BigBio corpus. I'm not sure which model the ML team has decided on, but if it is to be Galactica, should I omit these previously seen datasets?
- Galactica uses special tags for amino acid, DNA, and SMILES sequences. Should I also tag these entities, or would that get in the way if OA decides to go with a different representative scheme?
- BigBio is a collection of individual datasets. Should I keep these datasets as separate submissions to OA or should I bundle them all together?
If these datasets are outside the scope of OA, please let me know.
Thanks.