xianyu comments

Results 26 comments of


                                            xianyu

英文詞會被切成一個個字符

cppjieba针对英文数字等的处理是通过规则来完成的，但目前这些规则是和hmm耦合在一起的。因此只需要将这部分规则和hmm解耦即可，当然需要注意一些边界的处理。

能否像python一样，只划分用户自定义词典里面包含的词，其余词划分成单个字

请关闭hmm和规则部分即可（目前jieba中二者是耦合在一起的），因此关闭hmm即可纯基于词典进行最大概率分词

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

@Narsil thanks! Here is the code: ` # Initialize an empty tokenizer tokenizer = BertWordPieceTokenizer( clean_text=True, handle_chinese_chars=True, strip_accents=True, lowercase=True, ) filepath = "./data/corpus.txt" # training tokenizer.train( files=[filepath], vocab_size=5000, min_frequency=2, limit_alphabet=10000,...

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

@Narsil Hi, Here is the corpus https://storage.googleapis.com/cluebenchmark/tasks/simclue_public.zip. I totally understand your concern. Smaller limit_alphabet works fine for me, as mentioned in above.But how smaller is suitable, e.g. limit_alphabet

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

@Narsil Yes, maybe other subword tokenization algorithm is more suitable for specific vocab size control?

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

@thomasw21 Hi, the question is that we want to deploy a tiny bert model on the mobile phone. So, we wanna compress the vocab to get smaller size and no...

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

@thomasw21 Thanks! I will try the byte level tokenizer in our sentence embedding task later. Hopefully it reduces the size significantly without compromising performance, and doesn't have a too severe...

关于word2vec我有一个问题就是，我看李沐的视频，他说每个词有两个向量表示，一个是作为中心词时候的向量表示，一个是作为背景词时候的向量表示。在训练的时候这一点要怎么表现出来呢？

构造中心词和背景词两个词嵌入层 `embed_size = 100 net = nn.Sequential( nn.Embedding(num_embeddings=len(idx_to_token), embedding_dim=embed_size), nn.Embedding(num_embeddings=len(idx_to_token), embedding_dim=embed_size) )`

请问词典中的zg词性是什么意思？

> You can check: [彙整中文與英文的詞性標註代號：結巴斷詞器與FastTag / Identify the Part of Speech in Chinese and English](http://blog.pulipuli.info/2017/11/fasttag-identify-part-of-speech-in.html). 感谢解惑！结巴的词性标记集与ICT的确有些许差异，下面是ICTPOS：词的分类 ======== * 实词：名词、动词、形容词、状态词、区别词、数词、量词、代词 * 虚词：副词、介词、连词、助词、拟声词、叹词。 ICTPOS3.0词性标记集 =================== n 名词 nr 人名...

请问一个词有多个词性，自定义字典该如何定义

结巴的词性标注是基于词典、规则和HMM的，HMM或许有机会区分不同语境下的词性，而从词典的角度暂时可能没有太好的办法。注：python版的词性标注是基于词典和HMM的，而cppjieba纯粹是基于词典和规则的。