nusa-crowd
nusa-crowd copied to clipboard
Closes #227 Data loader for Karonese sentiment
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
Checkbox
- [x] Confirm that this PR is linked to the dataset issue.
- [x] Create the dataloader script
nusantara/nusa_datasets/my_dataset/my_dataset.py(please use only lowercase and underscore for dataset naming). - [x] Provide values for the
_CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_NUSANTARA_VERSIONvariables. - [x] Implement
_info(),_split_generators()and_generate_examples()in dataloader script. - [x] Make sure that the
BUILDER_CONFIGSclass attribute is a list with at least oneNusantaraConfigfor the source schema and one for a nusantara schema. - [x] Confirm dataloader script works with
datasets.load_datasetfunction. - [x] Confirm that your dataloader script passes the test suite run with
python -m tests.test_nusantara --path=nusantara/nusa_datasets/my_dataset/my_dataset.py. - [ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
This dataset is a bit noisy at the moment, aside from having inconsistent labeling (numeric vs string), some data has no labels at all. I've sent a PR to that dataset https://github.com/imkarokaro123/karonese/pull/1 in which aside from cleaning the data, also add extra username masking to add some privacy.
@afaji : So, let's just use the one from your fork and move forward with the PR, shall we?
Waiting for this PR to be approved https://github.com/imkarokaro123/karonese/pull/3
Hi @aliakbars : Perhaps we can just use the data from your fork for now, since we couldn't get any update from the author of the dataset
Updated. Should be working properly now, @SamuelCahyawijaya @holylovenia @muhsatrio.
/test dataset=karonese_sentiment