fastNLP icon indicating copy to clipboard operation
fastNLP copied to clipboard

Add a LSTM-CRF model at Conlll2003 Dataset

Open hazelnutsgz opened this issue 5 years ago • 9 comments

Description

Add the LSTM-CRF model for Conll2003 dataset at reproduction dir based on fastNLP lib, inspired by the paper https://arxiv.org/pdf/1508.01991.pdf

Main reason

Provide a new demo for how fastNLP can facilitate the development of the deep learning model. FYI: https://github.com/hazelnutsgz/fastNLP/tree/hazelnutsgz-crf-lstm/reproduction/LSTM-CRF

Checklist 检查下面各项是否完成

Please feel free to remove inapplicable items for your PR.

  • [x] The PR title starts with [$CATEGORY] (例如[bugfix]修复bug,[new]添加新功能,[test]修改测试,[rm]删除旧代码)
  • [x] Changes are complete (i.e. I finished coding on this PR) 修改完成才提PR
  • [x] All changes have test coverage 修改的部分顺利通过测试。对于fastnlp/fastnlp/的修改,测试代码必须提供在fastnlp/test/
  • [x] Code is well-documented 注释写好,API文档会从注释中抽取
  • [x] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change 修改导致例子或tutorial有变化,请找核心开发人员

Changes

  • Interactive jupyter notebook
  • Well-structured codebase for training & testing
  • A README file for the instruction

Mention:

@yhcc @xpqiu @FengZiYjun @2017alan

hazelnutsgz avatar Jan 10 '19 08:01 hazelnutsgz

Codecov Report

Merging #122 into master will decrease coverage by 0.11%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #122      +/-   ##
=========================================
- Coverage   70.31%   70.2%   -0.12%     
=========================================
  Files          82      82              
  Lines        5407    5407              
=========================================
- Hits         3802    3796       -6     
- Misses       1605    1611       +6
Impacted Files Coverage Δ
fastNLP/models/biaffine_parser.py 94.44% <0%> (-2.23%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ef82c1f...e67e354. Read the comment docs.

codecov-io avatar Jan 10 '19 08:01 codecov-io

Great! the logs and binary files are unnecessary to be committed.

xpqiu avatar Jan 10 '19 09:01 xpqiu

OK, I have updated my commit just now, thanks for your careful review.

hazelnutsgz avatar Jan 10 '19 11:01 hazelnutsgz

i think the data as well as the training code may not necessary in reproduction the reproduction should contain a trained model that can be directly used

xuyige avatar Jan 11 '19 06:01 xuyige

i think the data as well as the training code may not necessary in reproduction the reproduction should contain a trained model that can be directly used

Thanks for your comments @xuyige , the followings are my replies and proposals:

Reply

  1. Could you please elaborate the meaning of "the training code"? I am a little confused.
  2. I think the dataset, say, the Conll2003, is necessary for fastNLP users to reproduce the work, the reasons are as follows: 1. Acquiring the Conll2003 dataset with fastNLP is not as easy as acquiring mnist dataset by network api provided by the framework(tf&torch), so the preloaded dataset is necessary for the NLP novice to ramp up with the project. 2. Some other projects under the reproduction directory, say, Char-aware_NLM , also introduced the train.txt, test.txt

Proposal

Based on the design of how tf&pytorch loaded the mnist dataset(by network), I think the fastNLP may consider the data downloading APIs for some widely acknowledged NLP datasets, eg, SQUAD.

hazelnutsgz avatar Jan 11 '19 07:01 hazelnutsgz

i think the data as well as the training code may not necessary in reproduction the reproduction should contain a trained model that can be directly used

Thanks for your comments @xuyige , the followings are my replies and proposals:

Reply

  1. Could you please elaborate the meaning of "the training code"? I am a little confused.
  2. I think the dataset, say, the Conll2003, is necessary for fastNLP users to reproduce the work, the reasons are as follows:
    1. Acquiring the Conll2003 dataset with fastNLP is not as easy as acquiring mnist dataset by network api provided by the framework(tf&torch), so the preloaded dataset is necessary for the NLP novice to ramp up with the project.
    2. Some other projects under the reproduction directory, say, Char-aware_NLM , also introduced the train.txt, test.txt

Proposal

Based on the design of how tf&pytorch loaded the mnist dataset(by network), I think the fastNLP may consider the data downloading APIs for some widely acknowledged NLP datasets, eg, SQUAD.

i am so regret to point out that the char-aware-nlm were borrowed from other projects. it is an outdated version. codes haven't been updated for months. the data downloading has been considered, but a server is need for downloading, so we put it into todo list the preload dataset is a good suggestion, we will discuss it soon

xuyige avatar Jan 12 '19 17:01 xuyige

should be working, codes seem to be fine but things still don't add up I am currently working on this one

Thanks for your review~

hazelnutsgz avatar Jan 18 '19 02:01 hazelnutsgz

Is the util file missing here? When I run it, I am prompted that there is no load_data function. image

Hou-jing avatar Jun 26 '22 12:06 Hou-jing

Yes, we don't have load_data function. You may use an old version.

yhcc avatar Jun 27 '22 04:06 yhcc