ragflow
ragflow copied to clipboard
[Bug]: the rag_tokenizer has infinite recursive in dfs_()
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
d447392
RAGFlow image version
0.17.1
Other environment information
Actual behavior
when the input has chinese words like "一一一一一一一一一一一一一一一十一十一十一十一十一十一十一十一十一一一一一一十十十十十十十二十二十二十二十二十二十二十二二十二十二十二十二" , the function dfs_() has infinite recursion,and this will increase memory usage
Expected behavior
No response
Steps to reproduce
create a txt file with :一一一一一一一一一一一一一一一十一十一十一十一十一十一十一十一十一一一一一一十十十十十十十二十二十二十二十二十二十二十二二十二十二十二十二
upload to ragflow and parse
Additional information
No response
In rag_tokenizer.py
if __name__ == '__main__':
tknzr = RagTokenizer(debug=True)
# huqie.addUserDict("/tmp/tmp.new.tks.dict")
tks = tknzr.tokenize(
"一一一一一一一一一一一一一一一十一十一十一十一十一十一十一十一十一一一一一一十十十十十十十二十二十二十二十二十二十二十二二十二十二十二十二")
logging.info(tknzr.fine_grained_tokenize(tks))
tks = tknzr.tokenize(
"哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈")
logging.info(tknzr.fine_grained_tokenize(tks))
this will lead to infinite recursive call. and the memory will grow up!