jiebaR
jiebaR copied to clipboard
文本文件分解式词性标注失败
尊敬的覃博士,您好。我在词性标记过程中遇到了麻烦,请求您的帮助。具体情况如下: 第一,环境信息 R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936
[2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] jiebaR_0.9.99 jiebaRD_0.1 chinese.misc_0.1.9
loaded via a namespace (and not attached):
[1] compiler_3.5.1 magrittr_1.5 parallel_3.5.1 tools_3.5.1 NLP_0.1-11
[6] yaml_2.2.0 Rcpp_0.12.18 slam_0.1-43 xml2_1.2.0 stringi_1.1.7
[11] tm_0.7-5 Ruchardet_0.0-3 rlang_0.2.2 purrr_0.2.5
第二,全部错误信息
Warning messages:
1: In segment(itext, analyzer, mod = "mix") :
In file mode, only the first element will be processed.
2: In readLines(input.r, n = lines, encoding = encoding) : incomplete final line found on 'E:/201803D/0910ontosim/texttest/鍩轰簬鏂囩尞璁¢噺瀛︾殑鍥介檯鐏北鐢熸€佸鐮旂┒鎬佸娍鍒嗘瀽_榄忔檽闆?segment.2018-09-15_23_44_30.txt'
Error in file_coding(code[1]) : Cannot open file 第三,最小可重复代码和数据源文件,哪一步的代码出现错误
Text Processing and Analysis
ifolder<-"E:/201803D/0910ontosim/texttest" itext<-list.files(ifolder, pattern = ".txt", all.files = FALSE, recursive = TRUE, include.dirs = FALSE, full.names=TRUE)
tagging
library(jiebaR) analyzer <- worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = "E:/2017DN/data/custom.dict", stop_word ="E:/2017DN/data/stopwords.txt", write = TRUE, qmax = 20, topn = 5, encoding = "UTF-8", detect = TRUE, symbol = FALSE, lines = 1e+05, output = NULL, bylines = TRUE, user_weight = "max") textseg <- segment(itext, analyzer, mod = "mix") tokenizer <- worker("tag") pos_tag<-tagging(textseg, tokenizer)
第四,尝试过用什么方式来解决,可能的问题根源 测试过字符串格式输入分词标记对象,执行无误。 测试过一步到位的词性标记,无误。(但是不知可否使用第三方词典,专业文档标注十分仰赖专业词汇。) 换回文本文件会在分词后的标记步骤报错,仍然声称无法读取文档,不生成第二个标记的分词文档。
@Hz-EMW
1、确保文件路径不包含中文(还可以用 normalizePath(fs::dir_ls("E:/201803D/0910ontosim/texttest",glob = "*.txt"))
)
2、确保文件编码为UTF-8/或者在读取文件的时候指定编码
3、用户自定义词典里面可以添加专业词汇