Open-Assistant dataset: BigScience Biomedical Datasets

dataset: BigScience Biomedical Datasets

Open casey-martin opened this issue 1 year ago • 0 comments

I was looking into adding some datasets from the BigBio repository, and I had some questions before proceeding.

Galactica was trained on a subset of the BigBio corpus. I'm not sure which model the ML team has decided on, but if it is to be Galactica, should I omit these previously seen datasets?
Galactica uses special tags for amino acid, DNA, and SMILES sequences. Should I also tag these entities, or would that get in the way if OA decides to go with a different representative scheme?
BigBio is a collection of individual datasets. Should I keep these datasets as separate submissions to OA or should I bundle them all together?

If these datasets are outside the scope of OA, please let me know.

Thanks.

Mar 03 '23 03:03 casey-martin