word2vec
word2vec copied to clipboard
not chinese
file_in is this picture content
model <- word2vec(x = file_in, type = "cbow", dim = 15, iter = 20)
lookslike <- predict(model, c( "鹰"), type = "nearest", top_n = 5)
lookslike
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) : Could not find the word in the dictionary: 鹰
but 鹰 is in this picture content.Can you provide an example in Chinese?
I did this on a Linux box on this file. Not sure if this makes sense. I don't speak Chinese. example.txt
> x <- readLines("example.txt", encoding = "UTF-8")
> cat(x)
形态 开花的苹果树 落叶乔木,树高可达15米,栽培条件下一般高3~5米。树干灰褐色,老皮有不规则的纵裂或片状剥落,小枝光滑。叶序为单叶互生,椭圆至卵圆形,叶缘有锯齿。伞房花序,花瓣白色,含苞时带粉红色,雄蕊20,花柱5,大多数品种自花不育,需种植授粉树。果实为仁果,颜色及大小因品种而异。蘋果膳食纖維含量很豐富﹐也含有大量的果膠﹐對於整腸及調整腸道菌叢生態大有幫助。是一種綠色水果 [编辑] 习性 喜光,喜微酸性到中性土壤。最适于土层深厚,富含有机质,心土通气排水良好的沙质土壤。 [编辑] 品种 世界苹果产量 蘋果有超过7,500个已知品种。良种有红星系列、紅富士、乔纳森等。美國的名種有Red Delicious(香港稱地利蛇果,簡稱蛇果;台灣稱五爪蘋果)、Gold Delicious等[1]。英國北威爾斯巴德西島(Bardsey Island)則在近年發現新品種,比普通的果樹更健康,除了蟲害以外,並不會患病,被媒體稱為「世界上最罕有的蘋果」。除鮮食的品種外,尚有烹調用的蘋果。由於蘋果的果酸有保持水份的作用,適宜烤焗。> w2v <- word2vec::word2vec(x, min_count = 0)
> predict(w2v, newdata = "形态", type = "nearest")
$形态
term1 term2 similarity rank
1 形态 1 0.5622963 1
2 形态 。英國北威爾斯巴德西島(Bardsey 0.5039170 2
3 形态 习性 0.4285294 3
4 形态 世界苹果产量 0.4036402 4
5 形态 落叶乔木,树高可达15米,栽培条件下一般高3~5米。树干灰褐色,老皮有不 0.3262694 5
6 形态 蘋果有超过7 0.2895889 6
7 形态 开花的苹果树 0.2073026 7
8 形态 、Gold 0.1607383 8
> summary(w2v)
[1] "500个已知品种。良种有红星系列、紅富士、乔纳森等。美國的名種有Red"
[2] "世界苹果产量"
[3] "喜光,喜微酸性到中性土壤。最适于土层深厚,富含有机质,心土通气排水\xe8"
[4] "</s>"
[5] "。英國北威爾斯巴德西島(Bardsey"
[6] "1"
[7] "Delicious等"
[8] "开花的苹果树"
[9] "蘋果有超过7"
[10] "、Gold"
[11] "落叶乔木,树高可达15米,栽培条件下一般高3~5米。树干灰褐色,老皮有不"
[12] "编辑"
[13] "Island)則在近年發現新品種,比普通的果樹更健康,除了蟲害以外,並不會\xe6"
[14] "习性"
[15] "形态"
[16] "品种"
[17] "Delicious(香港稱地利蛇果,簡稱蛇果;台灣稱五爪蘋果"
library(word2vec) x <- readLines("example.txt", encoding = "UTF-8") cat(x) w2v <- word2vec::word2vec(x, min_count = 0) predict(w2v, newdata = "形态", type = "nearest")
result is below: predict(w2v, newdata = "形态", type = "nearest") Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) : Could not find the word in the dictionary: 形态
How to solve this problem? My system is window.thank you!
Write your cleaned text to a file and run word2vec from the file (e.g. below test.txt) instead of passing a character vector
library(readr)
library(word2vec)
x <- txt_clean_word2vec(x, ascii = FALSE, alpha = FALSE, tolower = TRUE, trim = TRUE)
write_lines(x, file = "test.txt")
model <- word2vec(x = "test.txt", min_count = 0) ## you need to change hyperparameters to your own
terminology <- summary(model)
example <- sample(terminology, size = 2)
example
predict(model, newdata = example, type = "nearest")
Maybe R package version 0.4.0 solves this issue. It allows to build a word2vec model from a list of tokenised sentences. Writing text data to files before training for the file-based approach (word2vec.character) now uses useBytes = TRUE. So it seems to me you can choose either one of the 2 options.
Closing, feel free to re-open if needed.