Nuclear6 comments

Results 40 comments of


                                            Nuclear6

The entities extracted from Chinese manual documents are very messy

我这边有8个说明书文档，总共130KB，做了如下优化效果才稍微好点：构建索引阶段，模型使用豆包128k，跑一次10块钱： 1. embedding服务自己部署开源的bge-large-zh模型，借助oneapi进行部署； 2. 分块逻辑参考LangchainChatChat做了改造，避免使用cl100k_base切分token出现乱码的问题； 3. 重新定义实体类型，我把文档摘取一部分交给4o，让他帮我总结下需要定义哪些实体类型； 4. prompt改为中文，去掉跟说明书文档不相关的示例，可以借助4o模型生成对应的示例；查询阶段发现查出来的实体和query相差太大，原因是采用自定义的embedding服务，需要去掉和cl100k_base的相关操作，修改之后，效果有所提升。这是我的中文电子说明书优化经验，供大家参考！！！ I made the following optimizations: Index building phase: 1. The embedding service deploys the open source bge-large-zh model...

The entities extracted from Chinese manual documents are very messy

@KylinMountain 1 官方用的分块是先把文档token化，按照token数进行切分，对于中文来说容易出现乱码，我看Langchain-ChatChat开源项目中用中文字符数进行切分，有效避免chunk存在乱码。官方chunk：https://github.com/microsoft/graphrag/blob/main/graphrag/index/verbs/text/chunk/strategies/tokens.py 参考chunk：https://github.com/chatchat-space/Langchain-Chatchat/blob/master/libs/chatchat-server/chatchat/server/file_rag/text_splitter/chinese_recursive_text_splitter.py 2 我感觉分块跟模型没有太大关系，选择中文那种分块逻辑能够保证句子完整性，模型理解可能更好点。 3 没有使用官方的prompt调优，听你说容易报错，我直接拿4o对照翻译生成对应的模板。 4 按照我的理解，一个文档还是多个文档区别不大。它是针对分块抽取实体，然后针对实体和描述构建embedding，文档名没看到有太大的联系。

The entities extracted from Chinese manual documents are very messy

@dinhngoc267 It is recommended that the input document example defines the entity type with the help of the gpt-4o model

[Bug]: 初始化数据很慢

How much concurrency does your LLM model service support?

[Bug]: 初始化数据很慢

How are your chunks divided? Or should we divide chunks according to the openai tiktoken?

[Bug]: 初始化数据很慢

For the official chunking logic you are using, you can refer to @KylinMountain’s suggestion and try increasing the chuk size. It is best to combine the logs to carefully analyze...

本地部署显存不足如何多卡运行？

应该就是这个参数的原因，执行lmdeploy serve api_server /work/internlm2_5-7b-chat --server-port 8089 --tp 1 RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32 执行 lmdeploy serve api_server /work/internlm2_5-7b-chat --server-port 8089 --tp 4 我看我引入包tp设置也没问题，我再检查下

Nuclear6

The entities extracted from Chinese manual documents are very messy

The entities extracted from Chinese manual documents are very messy

The entities extracted from Chinese manual documents are very messy

[Bug]: 初始化数据很慢

[Bug]: 初始化数据很慢

[Bug]: 初始化数据很慢

本地部署显存不足如何多卡运行？

本地部署显存不足如何多卡运行？

并发环境中，进程调度容易混乱？

create_final_entities ERROR