jieba
jieba copied to clipboard
How to change the decoder
I am currently trying to use Jieba in combination with learning with texts. What I am attempting to do is for jieba to create a space between each "word" in the cmd. for example. 我想飞去北京, would break it down to 我,想,飞,去,北京. what i tried to do initially was use python -m jieba -d ' ' input.txt >output.txt but it would just keep doing "Prefix dic has been built successfully". I then tried python -m jieba -a file1 > file2 and i would get the error below
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\xilab\AppData\Local\Temp\jieba.cache
Loading model cost 1.173 seconds.
Prefix dict has been built successfully.
Traceback (most recent call last):
File "C:\Users\xilab\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\xilab\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\xilab\lib\site-packages\jieba_main.py", line 52, in
What do you guys think? sorry for poor formatting, this is my first post.
邮件已收到~
According to your description, it seems your input file encoding is not UTF-8, which causes Jieba to not decode and segment properly. I would recommend:
Convert your input.txt file to UTF-8 encoding. As mentioned in Jieba's readme, "The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8." So UTF-8 is preferred.
Hope this helps you resolve the issue with using Jieba. Let me know if you have any other questions!
Hey AlexanderMisel, turns out I've still got problems with it.
Initially, i tried to have the text in a word docx so i could choose the decoder, but I've got the same problem. in .docx, i selected UTF-8 and in the .txt it says it's UTF-8 BOM. Unfortunately, I've still got the same problem.
C:\Users\xilab>python -m jieba -d'' "C:\Users\xilab\Desktop\g.txt" > "C:\Users\xilab\Desktop\eb.txt"
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\xilab\AppData\Local\Temp\jieba.cache
Loading model cost 0.578 seconds.
Prefix dict has been built successfully.
Traceback (most recent call last):
File "C:\Users\xilab\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\xilab\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\xilab\lib\site-packages\jieba_main.py", line 52, in
any thoughts? Thank you.