zhanghy-alt comments

Results 4 comments of


                                            zhanghy-alt

The entities extracted from Chinese manual documents are very messy

代码改动，避免使用cl100k_base切分token出现乱码的问题，感谢Nuclear6 提供的思路 ``` # Copyright (c) 2024 Microsoft Corporation. # Licensed under the MIT License """A module containing run and split_text_on_tokens methods definition.""" import logging import re from typing import...

The entities extracted from Chinese manual documents are very messy

> 请教，你这份代码可以直接跑通处理中文chunk，下面下面中，为什么是[source_doc_idx] * len(chunk)，而不是单单[source_doc_idx] 一个呢。如下代码生成的create_base_text_units.csv表中，document_ids一栏每个chunk项都有n_tokens项目，都是重复的。有什么意义呢？ for source_doc_idx, text in mapped_ids: chunks = text_splitter.split_text(text) for chunk in chunks: result.append( TextChunk( text_chunk=chunk, source_doc_indices=[source_doc_idx] * len(chunk), n_tokens=len(chunk), ) ) 这段代码并没有任何意义，只是为了符合graphrag的输入