Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Add UA-SQuAD dataset for Ukrainian language

Open nmeln opened this issue 2 years ago • 2 comments

This is a Ukrainian version of Stanford Question Answering Dataset (SQuAD). It is a QA dataset with MIT license. While the page says it's a WIP, the dataset still has a good amount of data. I will try reaching out to them to find out if they have a more complete version.

Github link: https://github.com/fido-ai/ua-datasets/tree/main/ua_datasets/src/question_answering

The repo also contains Text Classification and Token Classification datasets, but I'm not sure if they are useful for OA. https://github.com/fido-ai/ua-datasets

Info about dataset:

Number of samples: 13 859 Number of questions without answers: 2 927 File size: 17.1 MB

Link to huggingface dataset: https://huggingface.co/datasets/FIdo-AI/ua-squad

nmeln avatar Feb 12 '23 22:02 nmeln

than you.

huu4ontocord avatar Feb 13 '23 17:02 huu4ontocord

@ontocord

Should I proceed to add it here? https://github.com/LAION-AI/Open-Assistant/blob/main/model/model_training/custom_datasets/qa_datasets.py

nmeln avatar Feb 13 '23 23:02 nmeln