fsner icon indicating copy to clipboard operation
fsner copied to clipboard

The trained model does not work very well...

Open ScottishFold007 opened this issue 2 years ago • 14 comments

Hello! Your open source project is great and is a great benefit! When I was testing the Chinese dataset, I found that I ran a few epochs and the results were not very good. Can you tell me what might be the cause of this? Train data: image

Example prediction:


import json
from fsner import FSNERModel, FSNERTokenizerUtils, pretty_embed

query_texts = [
    "阿贵住在户部巷吗?",
    "我不喜欢看《人鱼传说》",
    "我喜欢李柏林的'天空之城',写的很好"
]


support_texts = {
    "地址": [
		    "彭小军认为,国内银行现在走的是[E]台湾[/E]的发卡模式,先通过跑马圈地再在圈的地里面选择客户,", 
        "郑阿姨就赶到[E]文汇路[/E]排队拿钱,希望能将缴纳的一万余元学费拿回来,顺便找校方或者教委要个说法。", 
        "如今着整个[E]潮白河[/E]区域环境的巨大变化和环首都经济圈的快速推进,夏威夷水岸1号的稀缺价值越来越明显,", 
        "如今着整个潮白河区域环境的巨大变化和环首都经济圈的快速推进,[E]夏威夷水岸1号[/E]的稀缺价值越来越明显,",
         "这也让很多业主据此认为,[E]雅清苑[/E]是政府公务员挤对了国家的经适房政策。"
                  ],
    "书籍": [
         "除了冠军外有7个名额的入围奖,奖品是[E]《暗黑破坏神》全套小说[/E]、《魔兽争霸》全套小说", 
         "除了冠军外有7个名额的入围奖,奖品是《暗黑破坏神》全套小说、[E]《魔兽争霸》全套小说[/E]", 
         "本次促销活动赠送的周边产品全部都是限量版啊!值得一提的是[E]《红楼梦》[/E]精美人物主题书签A组、", 
         "“去年银监会下发[E]《关于信用卡套现活跃风险提示的通知》[/E]要求:严格禁止将pos机发放在个人名下,",
    ]
          }

device = 'cpu'

model_path = '/content/checkpoints/model'
tokenizer = FSNERTokenizerUtils(model_path)
queries = tokenizer.tokenize(query_texts).to(device)
supports = tokenizer.tokenize(list(support_texts.values())).to(device)

model = FSNERModel(model_path)
model.to(device)

p_starts, p_ends = model.predict(queries, supports)

# One can prepare supports once and reuse  multiple times with different queries
# ------------------------------------------------------------------------------
# start_token_embeddings, end_token_embeddings = model.prepare_supports(supports)
# p_starts, p_ends = model.predict(queries, start_token_embeddings=start_token_embeddings,
#                                  end_token_embeddings=end_token_embeddings)

output = tokenizer.extract_entity_from_scores(query_texts, queries, p_starts, p_ends,
                        entity_keys=list(support_texts.keys()), thresh=0.010)

print(json.dumps(output, indent=2,ensure_ascii=False))

# install displacy for pretty embed
pretty_embed(query_texts, output, list(support_texts.keys()))

image

ScottishFold007 avatar Mar 30 '22 09:03 ScottishFold007

Thanks for trying it out. This will be very helpful to fix bugs and make the library more usable.

Which pretrained-model did you use for training? For English, I used bert-base-uncased.

sayef avatar Mar 30 '22 09:03 sayef

Because I am using a Chinese dataset, the model Langboat/mengzi-bert-base, which is also based on the Chinese corpus, is used for training

ScottishFold007 avatar Mar 30 '22 09:03 ScottishFold007

I don't know if it would have anything to do with the language, but East Asian languages like Chinese, Korean, and Japanese all require word segmentation.

ScottishFold007 avatar Mar 30 '22 09:03 ScottishFold007

This is the style of the training corpus: {"address": ["彭小军认为,国内银行现在走的是[E]台湾[/E]的发卡模式,先通过跑马圈地再在圈的地里面选择客户,", "郑阿姨就赶到[E]文汇路[/E]排队拿钱,希望能将缴纳的一万余元学费拿回来,顺便找校方或者教委要个说法。", "如今着整个[E]潮白河[/E]区域环境的巨大变化和环首都经济圈的快速推进,夏威夷水岸1号的稀缺价值越来越明显,", "如今着整个潮白河区域环境的巨大变化和环首都经济圈的快速推进,[E]夏威夷水岸1号[/E]的稀缺价值越来越明显,", "这也让很多业主据此认为,[E]雅清苑[/E]是政府公务员挤对了国家的经适房政策。", "陈艳萍:买[E]西山[/E]的人的购房需求,主要有两种,一种是养老型的需求,很多人认为在西山是能够颐养天年的"], "name": ["[E]彭小军[/E]认为,国内银行现在走的是台湾的发卡模式,先通过跑马圈地再在圈的地里面选择客户,", "[E]温格[/E]的球队终于又踢了一场经典的比赛,2比1战胜曼联之后枪手仍然留在了夺冠集团之内,", "突袭黑暗雅典娜》中[E]Riddick[/E]发现之前抓住他的赏金猎人Johns,", "突袭黑暗雅典娜》中Riddick发现之前抓住他的赏金猎人[E]Johns[/E],", "吴三桂演义》小说的想像,说是为[E]牛金星[/E]所毒杀。……在小说中加插一些历史背景,", "市场仍存在对网络销售形式的需求,网络购彩前景如何?为此此我们采访业内专家[E]程阳[/E]先生。", "本报讯(记者[E]王吉瑛[/E])双色球即将出台新规,一等奖最高奖金可达到1000万元。昨天,中彩中心透露,", "价格高昂的大钻和翡翠消费为何如此火?通灵珠宝总裁[E]沈东军[/E]认为,这与原料稀缺有直接关系。“", "是目前表现最好的锋线组合之一,而[E]沃尔科特[/E]往往能够让对手的整个左边肋疲于防守,以目前枪手的能力和状态,", "[E]Svensson[/E]在接受媒体采访时表示,CAPCOM并没有放弃《街霸》电影系列,将推出新的《", "证券时报记者[E]唐曜华[/E]", "现役DotA明星选手,担任SOLO位的世界第一影魔[E]Pis(卜严骏)[/E],", "[E]郭庆祥[/E]:我们看画廊如果有好的艺术家,好的作品进去,我们是真正想去买好的艺术作品,而不是投资,", "腾讯新闻昨天[E]金庸[/E]逝世江湖再无金大侠订阅号消息昨天【15条】王者荣耀福利抢鲜看!队友的京东京东jd.", "[E]陈艳萍[/E]:买西山的人的购房需求,主要有两种,一种是养老型的需求,很多人认为在西山是能够颐养天年的,"]

ScottishFold007 avatar Mar 30 '22 09:03 ScottishFold007

Dataset preparation and pre-trained model selection seem fine. How was the val_loss_epoch and val_acc_epoch after the first few epochs?

sayef avatar Mar 30 '22 09:03 sayef

Dataset preparation and pre-trained model selection seem fine. How was the val_loss_epoch and val_acc_epoch after the first few epochs?

image

ScottishFold007 avatar Mar 30 '22 10:03 ScottishFold007

Great, Please continue training for at least 20 epochs. It should get better. It was also not great for me too at the first few epochs.

sayef avatar Mar 30 '22 10:03 sayef

Ok, I will continue my training and I will give you feedback later, thank you for your careful reply!

ScottishFold007 avatar Mar 30 '22 10:03 ScottishFold007

@ScottishFold007 Hi again! Was your training successful?

sayef avatar Apr 05 '22 19:04 sayef

@ScottishFold007 Hi again! Was your training successful?

I have tried many methods, but the results are still poor, I wonder why? Can I give you the data and trouble you to test it?

ScottishFold007 avatar Apr 07 '22 13:04 ScottishFold007

I can try. Please send a link to your dataset to [email protected]

sayef avatar Apr 12 '22 11:04 sayef

I can try. Please send a link to your dataset to [email protected]

Hello, I've sent you the processed training set and test set via [email protected], thanks for your help!

ScottishFold007 avatar Apr 13 '22 01:04 ScottishFold007

Hi @ScottishFold007 ! Could you get better results ?

polodealvarado avatar May 25 '22 15:05 polodealvarado

Hi @ScottishFold007 ! Could you get better results ?

still very bad~

ScottishFold007 avatar Aug 21 '22 13:08 ScottishFold007