gluon-nlp icon indicating copy to clipboard operation
gluon-nlp copied to clipboard

[Enhancement] add whole word mask for chinese

Open paperplanet opened this issue 5 years ago • 16 comments

Description

add whole word mask for chinese(for BertTokenizer only)

Checklist

Essentials

  • [x] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • [x] Changes are complete (i.e. I finished coding on this PR)
  • [ ] All changes have test coverage
  • [x] Code is well-documented

Changes

Comments

paperplanet avatar Jun 27 '19 03:06 paperplanet

Codecov Report

:exclamation: No coverage uploaded for pull request head (feature/chs-whole-word-mask@3e7468d). Click here to learn what that means. The diff coverage is n/a.

codecov[bot] avatar Jun 27 '19 03:06 codecov[bot]

Codecov Report

Merging #798 into master will increase coverage by 0.28%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #798      +/-   ##
==========================================
+ Coverage   88.23%   88.52%   +0.28%     
==========================================
  Files          73       73              
  Lines        6980     6980              
==========================================
+ Hits         6159     6179      +20     
+ Misses        821      801      -20     
Impacted Files Coverage Δ
src/gluonnlp/data/word_embedding_evaluation.py 89.31% <0.00%> (-7.64%) :arrow_down:
src/gluonnlp/data/glue.py 96.81% <0.00%> (-1.82%) :arrow_down:
src/gluonnlp/model/attention_cell.py 91.06% <0.00%> (+0.55%) :arrow_up:
src/gluonnlp/model/bert.py 94.62% <0.00%> (+2.98%) :arrow_up:
src/gluonnlp/model/transformer.py 91.66% <0.00%> (+4.80%) :arrow_up:
src/gluonnlp/model/utils.py 80.00% <0.00%> (+6.92%) :arrow_up:
src/gluonnlp/model/seq2seq_encoder_decoder.py 80.00% <0.00%> (+30.00%) :arrow_up:

codecov[bot] avatar Jun 27 '19 03:06 codecov[bot]

Job PR-798/1 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/1/index.html

mli avatar Jun 27 '19 04:06 mli

Job PR-798/3 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/3/index.html

mli avatar Jun 28 '19 02:06 mli

Job PR-798/4 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/4/index.html

mli avatar Jul 01 '19 16:07 mli

Job PR-798/5 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/5/index.html

mli avatar Jul 01 '19 17:07 mli

Job PR-798/6 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/6/index.html

mli avatar Jul 02 '19 20:07 mli

Job PR-798/8 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/8/index.html

mli avatar Jul 07 '19 19:07 mli

Job PR-798/11 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/11/index.html

mli avatar Aug 15 '19 08:08 mli

Job PR-798/12 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/12/index.html

mli avatar Aug 19 '19 11:08 mli

@paperplanet could you resolve the conflicts? @eric-haibin-lin any further comments?

leezu avatar Jan 15 '20 15:01 leezu

Job PR-798/13 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/13/index.html

mli avatar Feb 20 '20 14:02 mli

Job PR-798/14 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/14/index.html

mli avatar Feb 20 '20 21:02 mli

Job PR-798/15 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/15/index.html

mli avatar Feb 20 '20 21:02 mli

Job PR-798/16 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-798/16/index.html

mli avatar Feb 21 '20 04:02 mli

Sorry for the late reply. I think I have resolved the conflicts. There has been some procedure change need to be reviewed: Chinese tokenization has to be done before str token becomes token id. And, cn_whole_word_mask is designed not to be turned on with whole_word_mask in the same time.

paperplanet avatar Feb 21 '20 07:02 paperplanet