word2vec icon indicating copy to clipboard operation
word2vec copied to clipboard

not chinese

Open niutyut opened this issue 3 years ago • 3 comments

image file_in is this picture content model <- word2vec(x = file_in, type = "cbow", dim = 15, iter = 20) lookslike <- predict(model, c( "鹰"), type = "nearest", top_n = 5) lookslike

Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) : Could not find the word in the dictionary: 鹰

but 鹰 is in this picture content.Can you provide an example in Chinese?

niutyut avatar Sep 09 '21 13:09 niutyut

I did this on a Linux box on this file. Not sure if this makes sense. I don't speak Chinese. example.txt

> x <- readLines("example.txt", encoding = "UTF-8")
> cat(x)
形态 开花的苹果树 落叶乔木,树高可达15米,栽培条件下一般高3~5米。树干灰褐色,老皮有不规则的纵裂或片状剥落,小枝光滑。叶序为单叶互生,椭圆至卵圆形,叶缘有锯齿。伞房花序,花瓣白色,含苞时带粉红色,雄蕊20,花柱5,大多数品种自花不育,需种植授粉树。果实为仁果,颜色及大小因品种而异。蘋果膳食纖維含量很豐富﹐也含有大量的果膠﹐對於整腸及調整腸道菌叢生態大有幫助。是一種綠色水果 [编辑] 习性 喜光,喜微酸性到中性土壤。最适于土层深厚,富含有机质,心土通气排水良好的沙质土壤。 [编辑] 品种 世界苹果产量 蘋果有超过7,500个已知品种。良种有红星系列、紅富士、乔纳森等。美國的名種有Red Delicious(香港稱地利蛇果,簡稱蛇果;台灣稱五爪蘋果)、Gold Delicious等[1]。英國北威爾斯巴德西島(Bardsey Island)則在近年發現新品種,比普通的果樹更健康,除了蟲害以外,並不會患病,被媒體稱為「世界上最罕有的蘋果」。除鮮食的品種外,尚有烹調用的蘋果。由於蘋果的果酸有保持水份的作用,適宜烤焗。> w2v <- word2vec::word2vec(x, min_count = 0)
> predict(w2v, newdata = "形态", type = "nearest")
$形态
  term1                                                                term2 similarity rank
1  形态                                                                    1  0.5622963    1
2  形态                                      。英國北威爾斯巴德西島(Bardsey  0.5039170    2
3  形态                                                                 习性  0.4285294    3
4  形态                                                         世界苹果产量  0.4036402    4
5  形态 落叶乔木,树高可达15米,栽培条件下一般高3~5米。树干灰褐色,老皮有不  0.3262694    5
6  形态                                                          蘋果有超过7  0.2895889    6
7  形态                                                         开花的苹果树  0.2073026    7
8  形态                                                               、Gold  0.1607383    8

> summary(w2v)
 [1] "500个已知品种。良种有红星系列、紅富士、乔纳森等。美國的名種有Red"        
 [2] "世界苹果产量"                                                            
 [3] "喜光,喜微酸性到中性土壤。最适于土层深厚,富含有机质,心土通气排水\xe8"  
 [4] "</s>"                                                                    
 [5] "。英國北威爾斯巴德西島(Bardsey"                                         
 [6] "1"                                                                       
 [7] "Delicious等"                                                             
 [8] "开花的苹果树"                                                            
 [9] "蘋果有超过7"                                                             
[10] "、Gold"                                                                  
[11] "落叶乔木,树高可达15米,栽培条件下一般高3~5米。树干灰褐色,老皮有不"    
[12] "编辑"                                                                    
[13] "Island)則在近年發現新品種,比普通的果樹更健康,除了蟲害以外,並不會\xe6"
[14] "习性"                                                                    
[15] "形态"                                                                    
[16] "品种"                                                                    
[17] "Delicious(香港稱地利蛇果,簡稱蛇果;台灣稱五爪蘋果"      

jwijffels avatar Sep 09 '21 15:09 jwijffels

library(word2vec) x <- readLines("example.txt", encoding = "UTF-8") cat(x) w2v <- word2vec::word2vec(x, min_count = 0) predict(w2v, newdata = "形态", type = "nearest")

result is below: predict(w2v, newdata = "形态", type = "nearest") Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) : Could not find the word in the dictionary: 形态

How to solve this problem? My system is window.thank you!

niutyut avatar Sep 09 '21 23:09 niutyut

Write your cleaned text to a file and run word2vec from the file (e.g. below test.txt) instead of passing a character vector

library(readr)
library(word2vec)
x <- txt_clean_word2vec(x, ascii = FALSE, alpha = FALSE, tolower = TRUE, trim = TRUE)
write_lines(x, file = "test.txt")
model <- word2vec(x = "test.txt", min_count = 0) ## you need to change hyperparameters to your own 
terminology <- summary(model)
example <- sample(terminology, size = 2)
example
predict(model, newdata = example, type = "nearest")

jwijffels avatar Sep 10 '21 07:09 jwijffels

Maybe R package version 0.4.0 solves this issue. It allows to build a word2vec model from a list of tokenised sentences. Writing text data to files before training for the file-based approach (word2vec.character) now uses useBytes = TRUE. So it seems to me you can choose either one of the 2 options.

Closing, feel free to re-open if needed.

jwijffels avatar Oct 05 '23 14:10 jwijffels