SubwordEncoding-CWS
SubwordEncoding-CWS copied to clipboard
Subword Encoding in Lattice LSTM for Chinese Word Segmentation
Subword Encoding in Lattice LSTM for Chinese Word Segmentation
Subword encoding for Word Segmentation using Lattice LSTM.
Models and results can be found at our paper Subword Encoding in Lattice LSTM for Chinese Word Segmentation.
Requirement:
Python: 2.7
PyTorch: 0.3.0
Input format:
CoNLL format (prefer BMES tag scheme), with each character its label for one line. Sentences are splited with a null line.
中 B-SEG
国 E-SEG
最 B-SEG
大 E-SEG
氨 B-SEG
纶 M-SEG
丝 E-SEG
生 B-SEG
产 E-SEG
基 B-SEG
地 E-SEG
在 S-SEG
连 B-SEG
云 M-SEG
港 E-SEG
建 B-SEG
成 E-SEG
新 B-SEG
华 M-SEG
社 E-SEG
北 B-SEG
京 E-SEG
十 B-SEG
二 M-SEG
月 E-SEG
二 B-SEG
十 M-SEG
六 M-SEG
日 E-SEG
电 S-SEG
Pretrained Embeddings:
The pretrained character and word embeddings are the same with the embeddings in the baseline of RichWordSegmentor
- Character embeddings (gigaword_chn.all.a2b.uni.ite50.vec): Google Drive or Baidu Pan
- Character bigram embeddings (gigaword_chn.all.a2b.bi.ite50.vec): Google Drive or Baidu Pan
- Word embeddings (ctb.50d.vec): Google Drive or Baidu Pan
- Subword(BPE) embeddings: zh.wiki.bpe.op200000.d50.w2v.txt
How to run the code?
- Download the character embeddings, character bigram embeddings, BPE (or word) embeddings and set their directories in
main.py
. - Modify the
run_seg.py
by adding your train/dev/test file directory. -
sh run_seg.py
Cite:
Cite our paper as:
@article{yang2019subword,
title={Subword Encoding in Lattice LSTM for Chinese Word Segmentation},
author={Jie Yang, Yue Zhang, and Shuailong Liang},
booktitle={NAACL},
year={2019}
}