graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Bug]: 初始化数据很慢

Open fucktx opened this issue 1 year ago • 14 comments

python -m graphrag.index --root ./ragtest

2000多条新闻的csv文件,初始化了10几个小时,请教下有木有其他方式初始化,比如一个我分成100分小文档,然后分别初始化后,可以直接合并output文件下面的内容

fucktx avatar Jul 25 '24 01:07 fucktx

python -m graphrag.index --root ./ragtest

2000多条新闻的csv文件,初始化了10几个小时,请教下有木有其他方式初始化,比如一个我分成100分小文档,然后分别初始化后,可以直接合并output文件下面的内容

你是用OpenAI API 或者 Azure OpenAI吗?还是其他模型?好像用其他模型初始化数据都有类似的问题

zejiancai avatar Jul 25 '24 01:07 zejiancai

解释 csv 有没有报错?

boxingYi avatar Jul 25 '24 03:07 boxingYi

解释 csv 有没有报错?

没有错,就是很慢的嘛,有点可以确定的是,新闻的内筒长度比较长,有的可能中文汉字就上万字

fucktx avatar Jul 25 '24 03:07 fucktx

2000多条新闻的csv文件,初始化了10几个小时,请教下有木有其他方式初始化,比如一个我分成100分小文档,然后分别初始化后,可以直接合并output文件下面的内容

你是用OpenAI API 或者 Azure OpenAI吗?还是其他模型?好像用其他模型初始化数据都有类似的问题 我是用的其他第三方模型,接口稳定很少报错,然后默认的settings文件,我只是把 concurrent_requests: 10 batch_size: 5

fucktx avatar Jul 25 '24 03:07 fucktx

你 修改chunk大小为1200 和 100没?

KylinMountain avatar Jul 25 '24 22:07 KylinMountain

How much concurrency does your LLM model service support?

Nuclear6 avatar Jul 26 '24 01:07 Nuclear6

你 修改chunk大小为1200 和 100没?

没有

fucktx avatar Jul 26 '24 02:07 fucktx

How much concurrency does your LLM model service support?

chat: deepseek-chat(deepseek) concurrent_requests: 10 embeddings: embedding-2(zhipu) concurrent_requests: 5

fucktx avatar Jul 26 '24 02:07 fucktx

How are your chunks divided? Or should we divide chunks according to the openai tiktoken?

Nuclear6 avatar Jul 26 '24 02:07 Nuclear6

How are your chunks divided? Or should we divide chunks according to the openai tiktoken?

It is done through python - m graphrag. index -- init -- root/ The settings. yaml file generated by ragtest always defaults to the parameters in it, but only modifies some parameters of llm and embeddings,At the same time, I am currently doing it by loading multiple data at once and calling the run_pipelinew_ith_config function in batches

fucktx avatar Jul 26 '24 02:07 fucktx

For the official chunking logic you are using, you can refer to @KylinMountain’s suggestion and try increasing the chuk size. It is best to combine the logs to carefully analyze the time-consuming areas.

Nuclear6 avatar Jul 26 '24 02:07 Nuclear6

This is the original discussion about chunk size which should be able to decrease the total request and your token consumption. https://github.com/microsoft/graphrag/discussions/460

KylinMountain avatar Jul 26 '24 05:07 KylinMountain

This is the original discussion about chunk size which should be able to decrease the total request and your token consumption. #460

ok,tks

fucktx avatar Jul 26 '24 05:07 fucktx

For the official chunking logic you are using, you can refer to @KylinMountain’s suggestion and try increasing the chuk size. It is best to combine the logs to carefully analyze the time-consuming areas.

ok,tks

fucktx avatar Jul 26 '24 05:07 fucktx

现在还是,还是很慢吗?我使用阿里云的deepseek-r1模型,19篇文章,10个小时没完成

fenglex avatar Jun 10 '25 02:06 fenglex

现在还是,还是很慢吗?我使用阿里云的deepseek-r1模型,19篇文章,10个小时没完成

我拿100个文件,跑10天,进度只有13%,已经放弃了

pomeloking01 avatar Sep 25 '25 08:09 pomeloking01