byronyuan
byronyuan
请教,你这份代码可以直接跑通处理中文chunk,下面下面中,为什么是[source_doc_idx] * len(chunk),而不是单单[source_doc_idx] 一个呢。如下代码生成的create_base_text_units.csv表中,document_ids一栏每个chunk项都有n_tokens项目,都是重复的。有什么意义呢? for source_doc_idx, text in mapped_ids: chunks = text_splitter.split_text(text) for chunk in chunks: result.append( TextChunk( text_chunk=chunk, source_doc_indices=[source_doc_idx] * len(chunk), n_tokens=len(chunk), ) )
> Hi! We just released 0.3.0 with a fix to address unicode characters, Can you please try with that version? The problem remains after update to the 0.3.0 version in...
the output of nx.generate_graphml is encoding by HTML Entities, so display directly is abnormal. need decoding it by html.unescape().