gluon-nlp
gluon-nlp copied to clipboard
[Enhancement] add whole word mask for chinese
Description
add whole word mask for chinese(for BertTokenizer only)
Checklist
Essentials
- [x] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
- [x] Changes are complete (i.e. I finished coding on this PR)
- [ ] All changes have test coverage
- [x] Code is well-documented
Changes
Comments
Codecov Report
:exclamation: No coverage uploaded for pull request head (
feature/chs-whole-word-mask@3e7468d
). Click here to learn what that means. The diff coverage isn/a
.
Codecov Report
Merging #798 into master will increase coverage by
0.28%
. The diff coverage isn/a
.
@@ Coverage Diff @@
## master #798 +/- ##
==========================================
+ Coverage 88.23% 88.52% +0.28%
==========================================
Files 73 73
Lines 6980 6980
==========================================
+ Hits 6159 6179 +20
+ Misses 821 801 -20
Impacted Files | Coverage Δ | |
---|---|---|
src/gluonnlp/data/word_embedding_evaluation.py | 89.31% <0.00%> (-7.64%) |
:arrow_down: |
src/gluonnlp/data/glue.py | 96.81% <0.00%> (-1.82%) |
:arrow_down: |
src/gluonnlp/model/attention_cell.py | 91.06% <0.00%> (+0.55%) |
:arrow_up: |
src/gluonnlp/model/bert.py | 94.62% <0.00%> (+2.98%) |
:arrow_up: |
src/gluonnlp/model/transformer.py | 91.66% <0.00%> (+4.80%) |
:arrow_up: |
src/gluonnlp/model/utils.py | 80.00% <0.00%> (+6.92%) |
:arrow_up: |
src/gluonnlp/model/seq2seq_encoder_decoder.py | 80.00% <0.00%> (+30.00%) |
:arrow_up: |
Job PR-798/1 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/1/index.html
Job PR-798/3 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/3/index.html
Job PR-798/4 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/4/index.html
Job PR-798/5 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/5/index.html
Job PR-798/6 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/6/index.html
Job PR-798/8 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/8/index.html
Job PR-798/11 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/11/index.html
Job PR-798/12 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/12/index.html
@paperplanet could you resolve the conflicts? @eric-haibin-lin any further comments?
Job PR-798/13 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/13/index.html
Job PR-798/14 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/14/index.html
Job PR-798/15 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/15/index.html
Job PR-798/16 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/16/index.html
Sorry for the late reply. I think I have resolved the conflicts. There has been some procedure change need to be reviewed: Chinese tokenization has to be done before str token becomes token id. And, cn_whole_word_mask
is designed not to be turned on with whole_word_mask
in the same time.