allennlp Add HuggingfaceDatasetReader for using Huggingface datasets

Added a new reader to allow for reading huggingface datasets as instance Mapped limited datasets.features to allenlp.data.fields

Verified for selective dataset and/or dataset configurations for training split, mentioned in the documentation comments of the reader.

datasets==1.5.0

Joint work with @divijbajaj @annajung @prajaktakini-vmware @agururajvais

Signed-off-by: Abhishek P (VMware) [email protected]

Fixes #4962

Changes proposed in this pull request: Introduce a new reader that wraps huggingface datasets to provide instances for a split of the dataset with configuration if required

Apr 04 '21 17:04 ghost

Looking at the lint and format issues. Will address them hopefully today.

Apr 05 '21 04:04 ghost

Locally new tests pass - tests/data/dataset_readers/huggingface_datasets_test.py ........ . [ 25%]

But since they are doing downloads it is quite possible they fail. Will try to see if I can pin point it. But should be okay for the draft.

Apr 05 '21 17:04 ghost

It's OK for tests to do downloads, but it's better if the downloads are small. Is there a tiny dataset we can use? Maybe we upload one just for this purpose?

Apr 05 '21 20:04 dirkgr

Thanks for this PR, @pab-vmware, @divijbajaj, @annajung, @pkini-vmware and @agururajvais.

And thanks @dirkgr for your thorough review.

Just let me know if there is anything we can do on Hugging Face Datasets to help you with the integration. :hugs:

Apr 06 '21 08:04 albertvillanova

It's OK for tests to do downloads, but it's better if the downloads are small. Is there a tiny dataset we can use? Maybe we upload one just for this purpose?

Nice Idea, we can formulate a custom dataset and test around it to validate the mapping of datasets.features to allenlp.data.fields.

In the meantime I noticed glue is small, will change UTs to use that only.

We can probably have sanity scripts to test w.r.t specific important datasets to validate support for them, independent of UT.

Let me make one such dataset as part of it. Will ask on where to place it (my repo or somewhere else) once I have it.

Apr 06 '21 13:04 ghost

Looks like I need to do more changes to get it working even for glue. Will inform once it is done.

Apr 07 '21 17:04 ghost

Working for 3 datasets. With couple different types of features.

Pending addressing of type check failures

Samples from the 2 datasets, xnli is a little too big for a comment

glue-cola:
From:{'idx': 0, 'label': 1, 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}
To:
Instance with fields:
 	 sentence: TextField of length 1 with text: 
 		[Our friends won't buy this analysis, let alone the next one we propose.]
 
 	 label: LabelField with label: 1 in namespace: 'label'. 
 	 idx: LabelField with label: 0 in namespace: 'idx'. 


_universal_dependencies - en_lines:_
From:
{'deprel': ['mark', 'det', 'nsubj', 'advcl', 'case', 'det', 'compound', 'flat', 'obl', 'case', 'det', 'compound', 'flat', 'nmod', 'punct', 'det', 'nsubj:pass', 'aux:pass', 'root', 'case', 'det', 'compound', 'flat', 'compound', 'obl', 'punct'], 'deps': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None'], 'feats': ["{'Case': 'Nom'}", "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Pres', 'VerbForm': 'Fin'}", 'None', "{'Definite': 'Def', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", 'None', "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Case': 'Nom'}", "{'Number': 'Sing'}", 'None', "{'Definite': 'Def', 'PronType': 'Art'}", "{'Number': 'Sing'}", "{'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Pres', 'VerbForm': 'Fin'}", "{'Tense': 'Past', 'VerbForm': 'Part', 'Voice': 'Pass'}", 'None', "{'Definite': 'Ind', 'PronType': 'Art'}", "{'Case': 'Nom'}", "{'Case': 'Nom'}", "{'Number': 'Sing'}", "{'Number': 'Sing'}", 'None'], 'head': ['4', '3', '4', '19', '9', '9', '9', '7', '4', '14', '14', '14', '12', '9', '4', '17', '19', '19', '0', '25', '25', '25', '22', '25', '19', '19'], 'idx': 'en_lines-ud-dev-doc1-3177', 'lemmas': ['when', 'a', 'user', 'connect', 'to', 'the', 'SQL', 'server', 'database', 'through', 'a', 'Microsoft', 'Access', 'project', ',', 'the', 'connection', 'be', 'enable', 'through', 'a', 'Windows', 'NT', 'user', 'account', '.'], 'misc': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', "{'SpaceAfter': 'No'}", 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', "{'SpaceAfter': 'No'}", 'None'], 'text': 'When a user connects to the SQL Server database through a Microsoft Access project, the connection is enabled through a Windows NT user account.', 'tokens': ['When', 'a', 'user', 'connects', 'to', 'the', 'SQL', 'Server', 'database', 'through', 'a', 'Microsoft', 'Access', 'project', ',', 'the', 'connection', 'is', 'enabled', 'through', 'a', 'Windows', 'NT', 'user', 'account', '.'], 'upos': [5, 8, 0, 16, 2, 8, 10, 0, 0, 2, 8, 10, 10, 0, 1, 8, 0, 17, 16, 2, 8, 10, 10, 0, 0, 1], 'xpos': [None, 'IND-SG', 'SG-NOM', 'PRES', None, 'DEF', 'SG-NOM', 'SG-NOM', 'SG-NOM', None, 'IND-SG', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'Comma', 'DEF', 'SG-NOM', 'PRES-AUX', 'PASS', None, 'IND-SG', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'SG-NOM', 'Period']}

To:
{'idx': <allennlp.data.fields.text_field.TextField object at 0x147f4b0c0>, 'text': <allennlp.data.fields.text_field.TextField object at 0x147c7a740>, 'tokens': <allennlp.data.fields.list_field.ListField object at 0x145b8c040>, 'lemmas': <allennlp.data.fields.list_field.ListField object at 0x145b8c820>, 'upos': <allennlp.data.fields.list_field.ListField object at 0x145b8c1f0>, 'xpos': <allennlp.data.fields.list_field.ListField object at 0x145b8cd60>, 'feats': <allennlp.data.fields.list_field.ListField object at 0x147ed3a00>, 'head': <allennlp.data.fields.list_field.ListField object at 0x147ed30d0>, 'deprel': <allennlp.data.fields.list_field.ListField object at 0x147ed39d0>, 'deps': <allennlp.data.fields.list_field.ListField object at 0x147ed3a30>, 'misc': <allennlp.data.fields.list_field.ListField object at 0x147ed3af0>}

Apr 08 '21 17:04 ghost

Have working implementation with limited , features-> field mapping. let me work on getting the code coverage up for the diff even if I have to use slightly heavier dataset at this point. I will also look at adding a custom HF dataset so we can use it for testing purposes.

I will try to increase the features that we are mapping but is the current change (post coverage) sufficient to go in?

Apr 10 '21 17:04 ghost

I will also look at adding a custom HF dataset so we can use it for testing purposes.

That would great!

Apr 12 '21 22:04 epwalsh

What's the status of this? Is it close? Anything you need? I'd love to have this in, maybe even for the 2.3 release!

Apr 13 '21 21:04 dirkgr

Hi Dirk, So now that the coverage is up to 90%, this MR IMO although slightly Raw in w.r.t implementation can go in. Let me re-check the known supported dataset/config and update it.

There are things that need to be done, for most of which I have added TODOs and I will pick them up one by one in the coming weeks.

Cleanup the code so we have mapper class/method for each of the features
Support Nested Features (Linked to Above)
Add Custom Light Dataset and configs and use them fo in UT
Add a test where the new reader is compared to the customer conll2003 reader - This would validate the reader in a practical manner

Apr 15 '21 03:04 ghost

So I have updated the datasets list, added validation tests that are now skipped. Squashed all changes into a single commit. Please take a look when time permits.

I will work on cleaning up the code and providing nested support. If the dataset download during tests becomes problematic, I will prioritize the creation of custom dataset.

Apr 15 '21 04:04 ghost

Squashed all changes into a single commit.

Don't worry about doing that for our sake. It'll get squashed when we merge anyways. In fact, now I have no way of seeing just the changes you made since the last time I reviewed this.

Apr 19 '21 18:04 dirkgr

What's the status of this? It looked like it was pretty close to done?

May 03 '21 23:05 dirkgr

I was not able to make time for this for it last couple weeks. Plan to resume this weekend with focus on recursive - mapper based implementation.

May 04 '21 05:05 ghost

I think the docstring of line 135 of trainer.py (the model parameter) also needs to be updated
    If you are training your model using GPUs, your model should already be
    on the correct device. (If you are using our `train` command this will be
    handled for you.)
The part in parenthesis implies that the model doesn't have to be on the GPU if you're using the train command. I think it should be the same as your change above, i.e.
    If you are training your model using GPUs, your model should already be
    on the correct device, which you can do with `model = model.cuda()`. 
Perhaps add a link instead of repeating? i.e.
    See `cuda_device` for more details.

Hi David, Not sure which of my change you are referring can you please point me to the File and LineNo.

May 09 '21 18:05 ghost

Have to re-add some of the inline comments, I have removed during the refactoring. I have added mapping Functions as independent functions. Let me know if you want me to move them as methods to the reader class.

May 09 '21 18:05 ghost

I don't know what's happening here, but GitHub thinks this PR has 8000 lines? It's not reviewable like this. I'll try to find out how to work around that.

May 10 '21 17:05 dirkgr

To make a clean PR that can be tested and reviewed, I flattened this whole thing into a new PR: #5194

Unfortunately we'll lose the review and comment history this way, but we gain the ability to understand what's going on :-)

May 10 '21 18:05 dirkgr

I don't know what's happening here, but GitHub thinks this PR has 8000 lines? It's not reviewable like this. I'll try to find out how to work around that.

Seeing that now. Let me try and fix that somehow.

May 10 '21 18:05 ghost

Looks like I messed up my rebase/merge last night. Let me try and fix that.

May 10 '21 18:05 ghost

Ok. If that fails, we can switch to the other PR that I just made.

May 10 '21 18:05 dirkgr

On the one hand, I don't think it's your fault. I re-based the same way, and I also got a mess in the GitHub UI.

On the other hand, just don't re-base (in the future). Just merge origin/main into here. It's easier, and it doesn't confuse GitHub.

May 10 '21 18:05 dirkgr

Done. Finally, 6 years of git torment came handy 😄.

May 10 '21 18:05 ghost

How did you do it?

May 10 '21 20:05 dirkgr

How did you do it?

Had to drop the pestering commit and then force push the branch. I seem to have made some mistake before hand which caused this problem.

May 11 '21 13:05 ghost

Pending local workflow checks before requesting workflow, here.

May 12 '21 17:05 ghost

Looks like the refactor did cause regression. the tests I had skipped since they downloaded datasets are not failing for some datasets. Working on fixing them, will keep them unskipped until we are actually ready to merge.

May 25 '21 15:05 ghost

Looks like the refactor did cause regression. the tests I had skipped since they downloaded datasets are not failing for some datasets. Working on fixing them, will keep them unskipped until we are actually ready to merge.

Fixed now.

May 25 '21 16:05 ghost

@dirkgr I am thinking of picking a few datasets and adding clear structure-based test to make sure this implementation is valid to the desired extent.

1st one I am picking is conll2003, Please recommend other datasets so that I can have a clearer goal/checkpoints I can work towards. Testing coverage w.r.t to 90% of datasets is not something that is not very easy since I would need to figure out the config for most of the datasets (800+ of them) in addition to having to download them.

if that is compulsorily required, I will try to write a script to test this for each of the datasets, probably run it in AWS VM I have so as to avoid the network problems.

For conll2003 - https://huggingface.co/datasets/conll2003 On HF side it is as follows

{
    "chunk_tags": [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0],
    "id": "0",
    "ner_tags": [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "pos_tags": [12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7],
    "tokens": ["The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers", "to", "shun", "British", "lamb", "until", "scientists", "determine", "whether", "mad", "cow", "disease", "can", "be", "transmitted", "to", "sheep", "."]
}

Currently it is read in the converting to as follows


         id: TextField of length 1 with text: 
 		[0]
 
 	 tokens: ListField of 12 TextFields : 
 	 TextField of length 1 with text: 
 		[SOCCER]
 
 	 TextField of length 1 with text: 
 		[-]
 
 	 TextField of length 1 with text: 
 		[JAPAN]
 
 	 TextField of length 1 with text: 
 		[GET]
 
 	 TextField of length 1 with text: 
 		[LUCKY]
 
 	 TextField of length 1 with text: 
 		[WIN]
 
 	 TextField of length 1 with text: 
 		[,]
 
 	 TextField of length 1 with text: 
 		[CHINA]
 
 	 TextField of length 1 with text: 
 		[IN]
 
 	 TextField of length 1 with text: 
 		[SURPRISE]
 
 	 TextField of length 1 with text: 
 		[DEFEAT]
 
 	 TextField of length 1 with text: 
 		[.]
 
 
 	 pos_tags: ListField of 12 LabelFields : 
 	 LabelField with label: 21 in namespace: 'pos_tags'. 
 	 LabelField with label: 8 in namespace: 'pos_tags'. 
 	 LabelField with label: 22 in namespace: 'pos_tags'. 
 	 LabelField with label: 37 in namespace: 'pos_tags'. 
 	 LabelField with label: 22 in namespace: 'pos_tags'. 
 	 LabelField with label: 22 in namespace: 'pos_tags'. 
 	 LabelField with label: 6 in namespace: 'pos_tags'. 
 	 LabelField with label: 22 in namespace: 'pos_tags'. 
 	 LabelField with label: 15 in namespace: 'pos_tags'. 
 	 LabelField with label: 12 in namespace: 'pos_tags'. 
 	 LabelField with label: 21 in namespace: 'pos_tags'. 
 	 LabelField with label: 7 in namespace: 'pos_tags'. 
 
 	 chunk_tags: ListField of 12 LabelFields : 
 	 LabelField with label: 11 in namespace: 'chunk_tags'. 
 	 LabelField with label: 0 in namespace: 'chunk_tags'. 
 	 LabelField with label: 11 in namespace: 'chunk_tags'. 
 	 LabelField with label: 21 in namespace: 'chunk_tags'. 
 	 LabelField with label: 11 in namespace: 'chunk_tags'. 
 	 LabelField with label: 12 in namespace: 'chunk_tags'. 
 	 LabelField with label: 0 in namespace: 'chunk_tags'. 
 	 LabelField with label: 11 in namespace: 'chunk_tags'. 
 	 LabelField with label: 13 in namespace: 'chunk_tags'. 
 	 LabelField with label: 11 in namespace: 'chunk_tags'. 
 	 LabelField with label: 12 in namespace: 'chunk_tags'. 
 	 LabelField with label: 0 in namespace: 'chunk_tags'. 
 
 	 ner_tags: ListField of 12 LabelFields : 
 	 LabelField with label: 0 in namespace: 'ner_tags'. 
 	 LabelField with label: 0 in namespace: 'ner_tags'. 
 	 LabelField with label: 5 in namespace: 'ner_tags'. 
 	 LabelField with label: 0 in namespace: 'ner_tags'. 
 	 LabelField with label: 0 in namespace: 'ner_tags'. 
 	 LabelField with label: 0 in namespace: 'ner_tags'. 
 	 LabelField with label: 0 in namespace: 'ner_tags'. 
 	 LabelField with label: 1 in namespace: 'ner_tags'. 
 	 LabelField with label: 0 in namespace: 'ner_tags'. 
 	 LabelField with label: 0 in namespace: 'ner_tags'. 
 	 LabelField with label: 0 in namespace: 'ner_tags'. 
 	 LabelField with label: 0 in namespace: 'ner_tags'.

I see that tokens could be a single TextField of tokens instead of a ListField of TextField. But as we had discussed before we may not be able to distinguish whether a Sequence of string is Sequence of tokens or Sequences of texts.

May 25 '21 17:05 ghost

allennlp allennlp copied to clipboard

Add HuggingfaceDatasetReader for using Huggingface datasets

allennlp
allennlp copied to clipboard