Named-Entity-Recognition
Named-Entity-Recognition copied to clipboard
Corpus and a baseline neural network system for Named Entity Recognition in Hindi-English Code-Mixed social media text.
trafficstars
Named-Entity-Recognition
We have created a dataset of Hindi-English Code-Mixed Social Media Text (tweets) for the task of Named Entity Recognition. Tweets are pre-processed and annotated as per the 6 NER tags and a 7th Other tag.
NER-Tags
- B-Per Indicates the Begening of a Person's name.
- I-Per Indicates the intermediate of a Person's name.
- B-Org Indicates the Begening of a Organizations's name.
- I-Org Indicates the intermediate of a Organizations's name.
- B-Loc Indicates the Begening of a Locations's name.
- I-Loc Indicates the intermediate of a Locations's name.
- Other Indicates all the word not falling in any of the above 6.
eg:
| #Word | #Tag |
|---|---|
| Bharat | B-Loc |
| ke | Other |
| 2016 | Other |
| ke | Other |
| Demonetization | Other |
| mein | Other |
| kitna | Other |
| kala | Other |
| dhan | Other |
| real | Other |
| mein | Other |
| aaya | Other |
| ??? | Other |
| Accha | Other |
| hua | Other |
| ye | Other |
| prashna | Other |
| Miss | B-Per |
| Word | I-Per |
| Chillar | I-Per |
| ko | Other |
| nahi | Other |
| puccha | Other |
| gaya | Other |
| 0 | Other |
| #misschillar | B-Per |
| #missworld | Other |
| #Demonetisation | Other |
| #notebandi | Other |
| #modi | B-Per |
| #bjp | B-Org |
| #gujrat | B-Loc |
Contents
TwitterDatafolder contains Id's of the scrapped tweets insideScrappedfolder, and processed and annotated data as named inside this.- All the three Models.py are the files for the three ML classification models we used for our reserach paper.
- preprocessing and vector creation scripts are added with names indicating that.
- This dataset is in development and in future we will extend this to more number of tweets so as to make it a more reliable dataset for this taska and others.
Outputs
- DecisionTree and CRF models have direct
scorecalls that gives all the required stats. - Keras does not provide the same for displaying score stats for LSTM model, so we build a coustom call of all the measure values and took average over all the iterations (here 5).
- All the models performed well on the given data.
Decision Treemodel with a f1-score of 0.94.Conditional Random Field (CRF)model with a f1-score of 0.95.LSTMmodel with a f1-score of 0.95.
Authors
- Vinay Singh
- Deepanshu Vijay
- Syed A. Sarfaraz
- Manish Srivastava
LTRC IIIT-Hyderabad
Citation
Named Entity Recognition for Hindi-English Code-Mixed Social Media Text
2018, 27-35, Proceedings of the Seventh Named Entities Workshop here