文本文件分解式词性标注失败

Open Hz-EMW opened this issue 6 years ago • 1 comments

尊敬的覃博士，您好。我在词性标记过程中遇到了麻烦，请求您的帮助。具体情况如下：第一，环境信息 R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=Chinese (Simplified)_China.936 [2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936 [4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] jiebaR_0.9.99 jiebaRD_0.1 chinese.misc_0.1.9

loaded via a namespace (and not attached): [1] compiler_3.5.1 magrittr_1.5 parallel_3.5.1 tools_3.5.1 NLP_0.1-11
[6] yaml_2.2.0 Rcpp_0.12.18 slam_0.1-43 xml2_1.2.0 stringi_1.1.7
[11] tm_0.7-5 Ruchardet_0.0-3 rlang_0.2.2 purrr_0.2.5 第二，全部错误信息 Warning messages: 1: In segment(itext, analyzer, mod = "mix") : In file mode, only the first element will be processed. 2: In readLines(input.r, n = lines, encoding = encoding) : incomplete final line found on 'E:/201803D/0910ontosim/texttest/鍩轰簬鏂囩尞璁￠噺瀛︾殑鍥介檯鐏北鐢熸€佸鐮旂┒鎬佸娍鍒嗘瀽_榄忔檽闆?segment.2018-09-15_23_44_30.txt'

Error in file_coding(code[1]) : Cannot open file 第三，最小可重复代码和数据源文件，哪一步的代码出现错误

Text Processing and Analysis

ifolder<-"E:/201803D/0910ontosim/texttest" itext<-list.files(ifolder, pattern = ".txt", all.files = FALSE, recursive = TRUE, include.dirs = FALSE, full.names=TRUE)

tagging

library(jiebaR) analyzer <- worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = "E:/2017DN/data/custom.dict", stop_word ="E:/2017DN/data/stopwords.txt", write = TRUE, qmax = 20, topn = 5, encoding = "UTF-8", detect = TRUE, symbol = FALSE, lines = 1e+05, output = NULL, bylines = TRUE, user_weight = "max") textseg <- segment(itext, analyzer, mod = "mix") tokenizer <- worker("tag") pos_tag<-tagging(textseg, tokenizer)

第四，尝试过用什么方式来解决，可能的问题根源测试过字符串格式输入分词标记对象，执行无误。测试过一步到位的词性标记，无误。（但是不知可否使用第三方词典，专业文档标注十分仰赖专业词汇。）换回文本文件会在分词后的标记步骤报错，仍然声称无法读取文档，不生成第二个标记的分词文档。

Sep 15 '18 17:09 Hz-EMW

@Hz-EMW

1、确保文件路径不包含中文（还可以用 normalizePath(fs::dir_ls("E:/201803D/0910ontosim/texttest",glob = "*.txt"))） 2、确保文件编码为UTF-8/或者在读取文件的时候指定编码 3、用户自定义词典里面可以添加专业词汇

Sep 22 '18 06:09 BruceZhaoR

jiebaR jiebaR copied to clipboard

文本文件分解式词性标注失败

Text Processing and Analysis

tagging

jiebaR
jiebaR copied to clipboard