allennlp
allennlp copied to clipboard
Add HuggingfaceDatasetReader for using Huggingface datasets
Added a new reader to allow for reading huggingface datasets as instance
Mapped limited datasets.features to allenlp.data.fields
Verified for selective dataset and/or dataset configurations for training split, mentioned in the documentation comments of the reader.
datasets==1.5.0
Joint work with @divijbajaj @annajung @prajaktakini-vmware @agururajvais
Signed-off-by: Abhishek P (VMware) [email protected]
Fixes #4962
Changes proposed in this pull request: Introduce a new reader that wraps huggingface datasets to provide instances for a split of the dataset with configuration if required
Looking at the lint and format issues. Will address them hopefully today.
Locally new tests pass -
tests/data/dataset_readers/huggingface_datasets_test.py ........ . [ 25%]
But since they are doing downloads it is quite possible they fail. Will try to see if I can pin point it. But should be okay for the draft.
It's OK for tests to do downloads, but it's better if the downloads are small. Is there a tiny dataset we can use? Maybe we upload one just for this purpose?
Thanks for this PR, @pab-vmware, @divijbajaj, @annajung, @pkini-vmware and @agururajvais.
And thanks @dirkgr for your thorough review.
Just let me know if there is anything we can do on Hugging Face Datasets to help you with the integration. :hugs:
It's OK for tests to do downloads, but it's better if the downloads are small. Is there a tiny dataset we can use? Maybe we upload one just for this purpose?
Nice Idea, we can formulate a custom dataset and test around it to validate the mapping of datasets.features to allenlp.data.fields.
In the meantime I noticed glue is small, will change UTs to use that only.
We can probably have sanity scripts to test w.r.t specific important datasets to validate support for them, independent of UT.
Let me make one such dataset as part of it. Will ask on where to place it (my repo or somewhere else) once I have it.
Looks like I need to do more changes to get it working even for glue. Will inform once it is done.
Working for 3 datasets. With couple different types of features.
Pending addressing of type check failures
Samples from the 2 datasets, xnli is a little too big for a comment
glue-cola:
From:{'idx': 0, 'label': 1, 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}
To:
Instance with fields:
sentence: TextField of length 1 with text:
[Our friends won't buy this analysis, let alone the next one we propose.]
label: LabelField with label: 1 in namespace: 'label'.
idx: LabelField with label: 0 in namespace: 'idx'.
_universal_dependencies - en_lines:_
From:
{'deprel': ['mark', 'det', 'nsubj', 'advcl', 'case', 'det', 'compound', 'flat', 'obl', 'case', 'det', 'compound', 'flat', 'nmod', 'punct', 'det', 'nsubj:pass', 'aux:pass', 'root', 'case', 'det', 'compound', 'flat', 'compound', 'obl', 'punct'], 'deps': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None'], 'feats': ["{'Case': 'Nom'}", "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Pres', 'VerbForm': 'Fin'}", 'None', "{'Definite': 'Def', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", 'None', "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Case': 'Nom'}", "{'Number': 'Sing'}", 'None', "{'Definite': 'Def', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Pres', 'VerbForm': 'Fin'}", "{'Tense': 'Past', 'VerbForm': 'Part', 'Voice': 'Pass'}", 'None', "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Case': 'Nom'}", "{'Case': 'Nom'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", 'None'], 'head': ['4', '3', '4', '19', '9', '9', '9', '7', '4', '14', '14', '14', '12', '9', '4', '17', '19', '19', '0', '25', '25', '25', '22', '25', '19', '19'], 'idx': 'en_lines-ud-dev-doc1-3177', 'lemmas': ['when', 'a', 'user', 'connect', 'to', 'the', 'SQL', 'server', 'database', 'through', 'a', 'Microsoft', 'Access', 'project', ',', 'the', 'connection', 'be', 'enable', 'through', 'a', 'Windows', 'NT', 'user', 'account', '.'], 'misc': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', "{'SpaceAfter': 'No'}", 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', "{'SpaceAfter': 'No'}", 'None'], 'text': 'When a user connects to the SQL Server database through a Microsoft Access project, the connection is enabled through a Windows NT user account.', 'tokens': ['When', 'a', 'user', 'connects', 'to', 'the', 'SQL', 'Server', 'database', 'through', 'a', 'Microsoft', 'Access', 'project', ',', 'the', 'connection', 'is', 'enabled', 'through', 'a', 'Windows', 'NT', 'user', 'account', '.'], 'upos': [5, 8, 0, 16, 2, 8, 10, 0, 0, 2, 8, 10, 10, 0, 1, 8, 0, 17, 16, 2, 8, 10, 10, 0, 0, 1], 'xpos': [None, 'IND-SG', 'SG-NOM', 'PRES', None, 'DEF', 'SG-NOM', 'SG-NOM', 'SG-NOM', None, 'IND-SG', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'Comma', 'DEF', 'SG-NOM', 'PRES-AUX', 'PASS', None, 'IND-SG', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'Period']}
To:
{'idx': <allennlp.data.fields.text_field.TextField object at 0x147f4b0c0>, 'text': <allennlp.data.fields.text_field.TextField object at 0x147c7a740>, 'tokens': <allennlp.data.fields.list_field.ListField object at 0x145b8c040>, 'lemmas': <allennlp.data.fields.list_field.ListField object at 0x145b8c820>, 'upos': <allennlp.data.fields.list_field.ListField object at 0x145b8c1f0>, 'xpos': <allennlp.data.fields.list_field.ListField object at 0x145b8cd60>, 'feats': <allennlp.data.fields.list_field.ListField object at 0x147ed3a00>, 'head': <allennlp.data.fields.list_field.ListField object at 0x147ed30d0>, 'deprel': <allennlp.data.fields.list_field.ListField object at 0x147ed39d0>, 'deps': <allennlp.data.fields.list_field.ListField object at 0x147ed3a30>, 'misc': <allennlp.data.fields.list_field.ListField object at 0x147ed3af0>}
Have working implementation with limited , features-> field mapping. let me work on getting the code coverage up for the diff even if I have to use slightly heavier dataset at this point. I will also look at adding a custom HF dataset so we can use it for testing purposes.
I will try to increase the features that we are mapping but is the current change (post coverage) sufficient to go in?
I will also look at adding a custom HF dataset so we can use it for testing purposes.
That would great!
What's the status of this? Is it close? Anything you need? I'd love to have this in, maybe even for the 2.3 release!
Hi Dirk, So now that the coverage is up to 90%, this MR IMO although slightly Raw in w.r.t implementation can go in. Let me re-check the known supported dataset/config and update it.
There are things that need to be done, for most of which I have added TODOs and I will pick them up one by one in the coming weeks.
- Cleanup the code so we have mapper class/method for each of the features
- Support Nested Features (Linked to Above)
- Add Custom Light Dataset and configs and use them fo in UT
- Add a test where the new reader is compared to the customer conll2003 reader - This would validate the reader in a practical manner
So I have updated the datasets list, added validation tests that are now skipped. Squashed all changes into a single commit. Please take a look when time permits.
I will work on cleaning up the code and providing nested support. If the dataset download during tests becomes problematic, I will prioritize the creation of custom dataset.
Squashed all changes into a single commit.
Don't worry about doing that for our sake. It'll get squashed when we merge anyways. In fact, now I have no way of seeing just the changes you made since the last time I reviewed this.
What's the status of this? It looked like it was pretty close to done?
I was not able to make time for this for it last couple weeks. Plan to resume this weekend with focus on recursive - mapper based implementation.
I think the docstring of line 135 of trainer.py (the model parameter) also needs to be updated
If you are training your model using GPUs, your model should already be on the correct device. (If you are using our `train` command this will be handled for you.)The part in parenthesis implies that the model doesn't have to be on the GPU if you're using the
traincommand. I think it should be the same as your change above, i.e.If you are training your model using GPUs, your model should already be on the correct device, which you can do with `model = model.cuda()`.Perhaps add a link instead of repeating? i.e.
See `cuda_device` for more details.
Hi David, Not sure which of my change you are referring can you please point me to the File and LineNo.
Have to re-add some of the inline comments, I have removed during the refactoring. I have added mapping Functions as independent functions. Let me know if you want me to move them as methods to the reader class.
I don't know what's happening here, but GitHub thinks this PR has 8000 lines? It's not reviewable like this. I'll try to find out how to work around that.
To make a clean PR that can be tested and reviewed, I flattened this whole thing into a new PR: #5194
Unfortunately we'll lose the review and comment history this way, but we gain the ability to understand what's going on :-)
I don't know what's happening here, but GitHub thinks this PR has 8000 lines? It's not reviewable like this. I'll try to find out how to work around that.
Seeing that now. Let me try and fix that somehow.
Looks like I messed up my rebase/merge last night. Let me try and fix that.
Ok. If that fails, we can switch to the other PR that I just made.
On the one hand, I don't think it's your fault. I re-based the same way, and I also got a mess in the GitHub UI.
On the other hand, just don't re-base (in the future). Just merge origin/main into here. It's easier, and it doesn't confuse GitHub.
Done. Finally, 6 years of git torment came handy 😄.
How did you do it?
How did you do it?
Had to drop the pestering commit and then force push the branch. I seem to have made some mistake before hand which caused this problem.
Pending local workflow checks before requesting workflow, here.
Looks like the refactor did cause regression. the tests I had skipped since they downloaded datasets are not failing for some datasets. Working on fixing them, will keep them unskipped until we are actually ready to merge.
Looks like the refactor did cause regression. the tests I had skipped since they downloaded datasets are not failing for some datasets. Working on fixing them, will keep them unskipped until we are actually ready to merge.
Fixed now.
@dirkgr I am thinking of picking a few datasets and adding clear structure-based test to make sure this implementation is valid to the desired extent.
1st one I am picking is conll2003, Please recommend other datasets so that I can have a clearer goal/checkpoints I can work towards.
Testing coverage w.r.t to 90% of datasets is not something that is not very easy since I would need to figure out the config for most of the datasets (800+ of them) in addition to having to download them.
if that is compulsorily required, I will try to write a script to test this for each of the datasets, probably run it in AWS VM I have so as to avoid the network problems.
For conll2003 - https://huggingface.co/datasets/conll2003 On HF side it is as follows
{
"chunk_tags": [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0],
"id": "0",
"ner_tags": [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"pos_tags": [12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7],
"tokens": ["The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers", "to", "shun", "British", "lamb", "until", "scientists", "determine", "whether", "mad", "cow", "disease", "can", "be", "transmitted", "to", "sheep", "."]
}
Currently it is read in the converting to as follows
id: TextField of length 1 with text:
[0]
tokens: ListField of 12 TextFields :
TextField of length 1 with text:
[SOCCER]
TextField of length 1 with text:
[-]
TextField of length 1 with text:
[JAPAN]
TextField of length 1 with text:
[GET]
TextField of length 1 with text:
[LUCKY]
TextField of length 1 with text:
[WIN]
TextField of length 1 with text:
[,]
TextField of length 1 with text:
[CHINA]
TextField of length 1 with text:
[IN]
TextField of length 1 with text:
[SURPRISE]
TextField of length 1 with text:
[DEFEAT]
TextField of length 1 with text:
[.]
pos_tags: ListField of 12 LabelFields :
LabelField with label: 21 in namespace: 'pos_tags'.
LabelField with label: 8 in namespace: 'pos_tags'.
LabelField with label: 22 in namespace: 'pos_tags'.
LabelField with label: 37 in namespace: 'pos_tags'.
LabelField with label: 22 in namespace: 'pos_tags'.
LabelField with label: 22 in namespace: 'pos_tags'.
LabelField with label: 6 in namespace: 'pos_tags'.
LabelField with label: 22 in namespace: 'pos_tags'.
LabelField with label: 15 in namespace: 'pos_tags'.
LabelField with label: 12 in namespace: 'pos_tags'.
LabelField with label: 21 in namespace: 'pos_tags'.
LabelField with label: 7 in namespace: 'pos_tags'.
chunk_tags: ListField of 12 LabelFields :
LabelField with label: 11 in namespace: 'chunk_tags'.
LabelField with label: 0 in namespace: 'chunk_tags'.
LabelField with label: 11 in namespace: 'chunk_tags'.
LabelField with label: 21 in namespace: 'chunk_tags'.
LabelField with label: 11 in namespace: 'chunk_tags'.
LabelField with label: 12 in namespace: 'chunk_tags'.
LabelField with label: 0 in namespace: 'chunk_tags'.
LabelField with label: 11 in namespace: 'chunk_tags'.
LabelField with label: 13 in namespace: 'chunk_tags'.
LabelField with label: 11 in namespace: 'chunk_tags'.
LabelField with label: 12 in namespace: 'chunk_tags'.
LabelField with label: 0 in namespace: 'chunk_tags'.
ner_tags: ListField of 12 LabelFields :
LabelField with label: 0 in namespace: 'ner_tags'.
LabelField with label: 0 in namespace: 'ner_tags'.
LabelField with label: 5 in namespace: 'ner_tags'.
LabelField with label: 0 in namespace: 'ner_tags'.
LabelField with label: 0 in namespace: 'ner_tags'.
LabelField with label: 0 in namespace: 'ner_tags'.
LabelField with label: 0 in namespace: 'ner_tags'.
LabelField with label: 1 in namespace: 'ner_tags'.
LabelField with label: 0 in namespace: 'ner_tags'.
LabelField with label: 0 in namespace: 'ner_tags'.
LabelField with label: 0 in namespace: 'ner_tags'.
LabelField with label: 0 in namespace: 'ner_tags'.
I see that tokens could be a single TextField of tokens instead of a ListField of TextField. But as we had discussed before we may not be able to distinguish whether a Sequence of string is Sequence of tokens or Sequences of texts.