ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: How to solve the problem that Excel document statistics function can only count top N data

Open leafcold opened this issue 8 months ago • 3 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

Image When conducting data analysis, it seems that only the data with ID 0-2 has been analyzed and processed, but each table contains different cell names

Image But in reality, there are many other unit names

Image How to conduct a comprehensive data query and analysis

leafcold avatar Apr 01 '25 03:04 leafcold

Hello, @leafcold! I'm here to help you with bug fixes, answer questions, and assist you in becoming a contributor. Let's work together to resolve any issues you're encountering.

I wasn't able to find specific information on how to solve the problem of the Excel document statistics function only counting the top N data within the RAGFlow repository. However, you can try the following general approach to address this issue:

  1. Review the ExcelParser Configuration: Ensure that the ExcelParser class is configured to process all relevant data, not just the top N entries. Check if there are any parameters or settings that limit the number of rows or sections being processed.

  2. Adjust Chunking and Parsing Settings: If the issue is related to how data is chunked or parsed, consider adjusting the chunking method or parser configuration to include all necessary data. This might involve modifying the parameters used in the html method of the ExcelParser class.

  3. Verify Data Retrieval Settings: Ensure that the retrieval settings are configured to include all relevant data. This might involve adjusting the similarity threshold, vector similarity weight, or other retrieval parameters to ensure comprehensive data analysis.

  4. Check for Software Updates or Patches: Look for any updates or patches in the RAGFlow repository that might address this issue. Sometimes, limitations in functionality are resolved in newer versions of the software.

If these steps do not resolve the issue, you may need to explore the codebase further or consult with other developers familiar with RAGFlow for more specific guidance.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar Apr 01 '25 03:04 dosubot[bot]

It seems one row one chunk, isn't it? What about increasing the chunk number size?

KevinHuSh avatar Apr 03 '25 01:04 KevinHuSh

It seems one row one chunk, isn't it? What about increasing the chunk number size?

But if we do this, it will actually put unnecessary pressure on the LLM model, which is equivalent to passing all blocks to the LLM network. Communication time is a cost. If ragflow can preprocess existing data before passing it on, it will meet the current needs

leafcold avatar Apr 03 '25 06:04 leafcold