GeneZC/Chinese-NER: A practice on Chinese Named Entity Recognition

Chinese-NER

An attempt to do better on Chinese NER task.

Reference & Inspiration

Neural Architectures for Named Entity Recognition
- with pretrained glove embedding
- Bi-LSTM (character-level input)
- (optional) LSTM (capture the features of Capitalization)
- concatenate above two (or three) to form the real input
- Bi-LSTM (word-level input and character-level input)
- CRF loss layer
Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition
- with pretrained word2vec(character2vec may be more proper) embedding
- Bi-LSTM (radical-level input)
- concatenate above two
- Bi-LSTM (real input)
- CRF loss layer
A neural network model for Chinese named entity recognition
See also in Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning
- with pretrained word2vec(same as above one) embedding
- do a segmentation operation to yield input
- concatenate above two
- Bi-LSTM (real input)
- CRF loss layer
NER-pytorch
- a pytorch version implementation which is similar to the very beginning one.
SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
- a framework using GAN + RL to solve sequence problem.
Adversarial Learning for Chinese NER from Crowd Annotations
- 2018 AAAI Alibaba
RL-GAN For NLP: 强化学习在生成对抗网络文本生成中扮演的角色

Requirement

Tensorflow
pynlpir
jieba

Usage

Note: Current repo only contains Version 1 which is discussed in Experiments, but you could easily modify the cofigurations in main.py to have the Baseline set in Experiments

Preprocess the dataset, whose format should be the same as the examples in /data. And the code in preprocess.py shall be considered as the refernce.
Yield a character-level embedding with Glove or fastText. Refer to the README.md in /Glove and /fastText to gain an intuition.
Modify configurations in main.py to get a brand new config suiting your idea.
Train your own model by

python main.py --train=True

Evaluate the outcome of your model by

python main.py

Contributions

If you have any question about this repo or you find some problems in my code, feel free to contact with me via email or just give comments in Issues.

Experiments

Baseline

Corpus: MSRA
Tag schema: iobes

我 S-PER
来 O
自 O
北 B-ORG
京 I-ORG
理 I-ORG
工 I-ORG
大 I-ORG
学 E-ORG
Dataset split

90% train, 10% test
Structure
- with pretrained GloVe embedding for characters and random initialized embedding for segmentations
- concatenate above two
- Bi-LSTM
- CRF loss layer
Result(Test only)

Type Accuracy Precision Recall FB1

LOC \ 94.11% 90.58% 92.31

ORG \ 86.86% 88.43% 87.64

PER \ 92.21% 92.45% 92.33

OVER ALL 98.79% 91.89% 90.71% 91.30

Type	Accuracy	Precision	Recall	FB1
LOC	\	94.11%	90.58%	92.31
ORG	\	86.86%	88.43%	87.64
PER	\	92.21%	92.45%	92.33
OVER ALL	98.79%	91.89%	90.71%	91.30

Version 1 (Bi-LSTM + CRF + CHAR_CNN)

Corpus: SAME as baseline
Tag schema: iobes

我 S-PER
来自 O
北京 B-ORG
理工 I-ORG
大学 E-ORG
Dataset split: SAME as baseline
Structure
- with pretrained GloVe embedding for words
- use a layer of CNN to capture the features of characters projected by embedding
- Bi-LSTM
- CRF loss layer
Result(Test only)

Type Accuracy Precision Recall FB1

LOC \ 93.31% 91.14% 92.21

ORG \ 87.15% 86.63% 86.89

PER \ 93.94% 91.94% 92.93

OVER ALL 98.83% 92.17% 90.42% 91.29

Chinese-NER
Chinese-NER copied to clipboard

Metadata

Chinese-NER

Reference & Inspiration

Requirement

Usage

Contributions

Experiments

Baseline

Version 1 (Bi-LSTM + CRF + CHAR_CNN)

← Metadata

Owner

Metadata

Type	Accuracy	Precision	Recall	FB1
LOC	\	93.31%	91.14%	92.21
ORG	\	87.15%	86.63%	86.89
PER	\	93.94%	91.94%	92.93
OVER ALL	98.83%	92.17%	90.42%	91.29

Chinese-NER Chinese-NER copied to clipboard

Metadata

Chinese-NER

Reference & Inspiration

Requirement

Usage

Contributions

Experiments

Baseline

Version 1 (Bi-LSTM + CRF + CHAR_CNN)

← Metadata

Owner

Metadata

Chinese-NER
Chinese-NER copied to clipboard