xianyu

Results 19 comments of xianyu

cppjieba针对英文数字等的处理是通过规则来完成的,但目前这些规则是和hmm耦合在一起的。因此只需要将这部分规则和hmm解耦即可,当然需要注意一些边界的处理。

请关闭hmm和规则部分即可(目前jieba中二者是耦合在一起的),因此关闭hmm即可纯基于词典进行最大概率分词

@Narsil thanks! Here is the code: ` # Initialize an empty tokenizer tokenizer = BertWordPieceTokenizer( clean_text=True, handle_chinese_chars=True, strip_accents=True, lowercase=True, ) filepath = "./data/corpus.txt" # training tokenizer.train( files=[filepath], vocab_size=5000, min_frequency=2, limit_alphabet=10000,...

@Narsil Hi, Here is the corpus https://storage.googleapis.com/cluebenchmark/tasks/simclue_public.zip. I totally understand your concern. Smaller limit_alphabet works fine for me, as mentioned in above.But how smaller is suitable, e.g. limit_alphabet

@Narsil Yes, maybe other subword tokenization algorithm is more suitable for specific vocab size control?

@thomasw21 Hi, the question is that we want to deploy a tiny bert model on the mobile phone. So, we wanna compress the vocab to get smaller size and no...

@thomasw21 Thanks! I will try the byte level tokenizer in our sentence embedding task later. Hopefully it reduces the size significantly without compromising performance, and doesn't have a too severe...

构造中心词和背景词两个词嵌入层 `embed_size = 100 net = nn.Sequential( nn.Embedding(num_embeddings=len(idx_to_token), embedding_dim=embed_size), nn.Embedding(num_embeddings=len(idx_to_token), embedding_dim=embed_size) )`

> You can check: [彙整中文與英文的詞性標註代號:結巴斷詞器與FastTag / Identify the Part of Speech in Chinese and English](http://blog.pulipuli.info/2017/11/fasttag-identify-part-of-speech-in.html). 感谢解惑! 结巴的词性标记集与ICT的确有些许差异, 下面是ICTPOS: 词的分类 ======== * 实词:名词、动词、形容词、状态词、区别词、数词、量词、代词 * 虚词:副词、介词、连词、助词、拟声词、叹词。 ICTPOS3.0词性标记集 =================== n 名词 nr 人名...

结巴的词性标注是基于词典、规则和HMM的,HMM或许有机会区分不同语境下的词性,而从词典的角度暂时可能没有太好的办法。 注:python版的词性标注是基于词典和HMM的,而cppjieba纯粹是基于词典和规则的。