Results 123 comments of hankcs

感谢贡献,我也有类似的concern。 1. 每次resize2倍,虽然减少了碎片,但增加了内存峰值 1. resize多少次应该与具体词典大小有关,可否预先给大辞典更大的初始大小?当然具体多少需要做实验总结 1. 直接在resize内部double size,违反了be explicit的原则

Hi, 1. tok不会舍弃文本的空格(全形和半形)。单词之间的空格不属于单词的一部分,理所当然不会出现在单词中。如果tok认为单词本身**含有**空格,该空格会作为单词的一部分保留。比如`'2021年蝴 蝶图标HanLPv2.1为生产环境带来次世代最先进的多语种Neuro-linguistic programming技术。'`会被分作`['2021年', '蝴 蝶', '图标', 'HanLPv2.1', '为', '生产', '环境', '带来', '次世代', '最', '先进', '的', '多', '语种', 'Neuro-linguistic', 'programming', '技术', '。']`。 2. 你的意思应该是认为`'Neuro-linguistic programming'`应当分作一个单词,这属于对分词颗粒度看法不同。按照MSR分词标准,英文词组应当被拆开,HanLP的模型很准确。 3. 从你的目的来讲,HanLP支持输出每个单词在文本中的原始位置,“还原文本”完全可行,几行代码的事情:https://colab.research.google.com/drive/1Q-CV_G-zSErzoT7PlVWzgYj-MNK1BBpf?usp=sharing

Hi, 目前的STS模型需要同时输入一对句子计算相似度,不支持输出embedding。我们正在研发用于检索的句子embedding,敬请关注后续更新。

Thank you @flipz357 for reporting this. The randomness of Smatch implementations has been documented on [our forum](https://bbs.hankcs.com/t/topic/2380) for 4 years and finally, you brought the community a solid solution. Your...

Good discussion. But I don't quite understand why this truncation/padding info has to be global. It can be passed as a parameter so that each tokenize call will be threadsafe.

I've read many issues and articles, although the official maintainers are suggesting to use tf ops for preprocessing, the practitioners find it's inconvenient or simply impossible. Let's take the famous...

Yes, as I mentioned in the [comment](https://github.com/hankcs/LDA4j/blob/master/src/test/java/com/hankcs/lda/TestCorpus.java), it is not so much a bug as a TODO feature. This method is under construction, but I did not have a good...

update内部有判断 if truth == guess: return None

Let's take '胖' as an example. ``` 胖 胖肉半肉 月 肉 肉半 ``` The third column is the simplified semantic radical, whereas the fourth column is the traditional form. The...

radical.vec is actually the vector of a character in the form of its radical list. radical.ngram.vec is the vector of each radical. For joint pre-training, you need to 1. Modify...