ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: the rag_tokenizer has infinite recursive in dfs_()

Open trexliu opened this issue 9 months ago • 1 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

d447392

RAGFlow image version

0.17.1

Other environment information


Actual behavior

when the input has chinese words like "一一一一一一一一一一一一一一一十一十一十一十一十一十一十一十一十一一一一一一十十十十十十十二十二十二十二十二十二十二十二二十二十二十二十二" , the function dfs_() has infinite recursion,and this will increase memory usage

Expected behavior

No response

Steps to reproduce

create a txt file with :一一一一一一一一一一一一一一一十一十一十一十一十一十一十一十一十一一一一一一十十十十十十十二十二十二十二十二十二十二十二二十二十二十二十二
upload to ragflow and parse

Additional information

No response

trexliu avatar Mar 14 '25 06:03 trexliu

In rag_tokenizer.py

if __name__ == '__main__':
    tknzr = RagTokenizer(debug=True)
    # huqie.addUserDict("/tmp/tmp.new.tks.dict")
    tks = tknzr.tokenize(
        "一一一一一一一一一一一一一一一十一十一十一十一十一十一十一十一十一一一一一一十十十十十十十二十二十二十二十二十二十二十二二十二十二十二十二")
    logging.info(tknzr.fine_grained_tokenize(tks))
    tks = tknzr.tokenize(
        "哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈")
    logging.info(tknzr.fine_grained_tokenize(tks))

this will lead to infinite recursive call. and the memory will grow up!

781574155 avatar Mar 14 '25 07:03 781574155