inception-external-recommender icon indicating copy to clipboard operation
inception-external-recommender copied to clipboard

Create HuggingFaceTransformer.py

Open mobashgr opened this issue 2 years ago • 3 comments

Here is the code for adding any HuggingFace Transformer model over INCEpTION

mobashgr avatar Feb 07 '22 15:02 mobashgr

Thank you for the PR! The basics looks good to me. This code still has the issue that you do not use the tokenization of INCEpTION, which requires you then to use character level granularity which is not so nice to use later when exporting the corpus and using it downstream. In other recommenders, we align the predictions of recommenders to the INCEpTION tokenization which you would need to do here also before I would merge it tbh. Examples and hints can be found in

https://github.com/huggingface/transformers/issues/14305 https://huggingface.co/docs/transformers/custom_datasets?highlight=offset_mapping#token-classification-with-wnut-emerging-entities https://discuss.huggingface.co/t/predicting-with-token-classifier-on-data-with-no-gold-labels/9373

I also do not understand why you would need pandas here, it is certainly possible to do it just without. As you only support token classification here, I would also name it TransformerTokenClassifier or so, the name indicates that it is a general implementation.

The file name does not fit with the users, Python and we us typically snake case for files.

It would be nice to have a unit test, even if it just does smoke testing.

jcklie avatar Feb 07 '22 19:02 jcklie

Regarding the first point, yes, I was facing this problem yesterday and used the character level granularity as suggested by Richard. My problem was resolved, and I don’t think that I have the time to do this alignment now. I just wanted to share what I have as a solution to a problem that I was facing especially since the Adapter code isn’t working, and it was misleading TBH. I believe that INCEpTION is a very powerful tool and it should definitely have examples for HuggingFace classifiers.

For the second point, I need pandas, as the output of the pipeline in my case is a list of list of dictionaries. A sample of the pipeline output looks like this [{'entity_group': 'Chemical', 'score': 0.9996301, 'word': 'acety', 'start': 66, 'end': 71}, {'entity_group': 'Chemical', 'score': 0.99999845, 'word': 'nicotine', 'start': 98, 'end': 106}, {'entity_group': 'Chemical', 'score': 0.99911577, 'word': 'la dicine evised', 'start': 122, 'end': 144}, {'entity_group': 'Chemical', 'score': 0.9999038, 'word': 'alpha - only hete', 'start': 308, 'end': 325}] . So, I prefer to change it into a dataframe.

mobashgr avatar Feb 07 '22 19:02 mobashgr

@mobashgr Sorry for getting back to you late. Could you please add the same Apache License license header to the file that we use in the other files?

I believe it should not be a strong problem if the recommender users a different tokenization. If the recommender creates a suggestion that does not fit in with the layer settings in INCEpTION, it will be ignored - it should not cause trouble.

reckart avatar Feb 27 '24 10:02 reckart