OpenNMT-py icon indicating copy to clipboard operation
OpenNMT-py copied to clipboard

The content in the pred.txt is repetitive

Open wangDong524 opened this issue 4 years ago • 2 comments

I use the transformer model to train my chinese dataset.After do the translate.py , the content in the pred.txt is repetitive. The output and the source is not corresponding.

SENT 713: ['特', '斯', '拉', '发', '布', '6.0', '版', '本', '固', '件', '后', ',', '允', '许', '车', '主', '通', '过', '手', '机', '端', 'app', '软', '件', '驾', '车', ',', '不', '用', '随', '身', '携', '带', '钥', '匙', '。', '但', '激', '活', '过', '程', '中', '会', '受', '到', '识', '别', '过', '程', '复', '杂', ',', '手', '机', '信', '号', '不', '稳', '定', '的', '困', '扰', ',', '实', '际', '使', '用', '并', '不', '能', '完', '全', '放', '弃', '传', '统', '钥', '匙', '。(', '分', '享', '自', '@', '电', '动', '邦', ')'] PRED 713: 抢 高 铁 票 改 下 午 了 ! 铁 路 部 门 增 6 个 放 票 时 间 点 PRED SCORE: -2.0617

SENT 714: ['在', '长', '安', '逸', '动', 'ev', '的', '发', '布', '会', '上', ',', '逸', '动', '公', '布', '了', '补', '贴', '后', '14.49', '至', '15.99', '万', '元', '的', '补', '贴', '价', '。', '作', '为', '国', '内', '首', '款', '紧', '凑', '型', '三', '厢', '纯', '电', '动', '车', ',', '凭', '借', '着', '2660mm', '的', '轴', '距', '空', '间', '表', '现', ',', '它', '将', '拥', '有', '很', '强', '的', '市', '场', '竞', '争', '力', '。(', '分', '享', '自', '@', '电', '动', '邦', ')'] PRED 714: 抢 高 铁 票 改 下 午 了 ! 铁 路 部 门 增 6 个 放 票 时 间 点 PRED SCORE: -1.9775

wangDong524 avatar Mar 07 '20 14:03 wangDong524

Your model is probably overfitting, and only learned to output this sentence.

francoishernandez avatar Mar 07 '20 22:03 francoishernandez

Your dataset might be too tiny. How many lines?

Your tokenizing of Chinese characters by single words will not give good results. Try using HanLP and/or Sentencepiece.

JOHW85 avatar Apr 04 '20 09:04 JOHW85