ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: Excel to Html Chunks exceeding 12 lines

Open ahmadshakil opened this issue 4 weeks ago • 4 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

Why during parsing if Excel to Html option is enabled chunk have more then 12 lines? Where as per functionality information it should not exceed 12 lines. Attached is the excel and created chunk

Financials Sample Data Report.xlsx

Chunk content -> abc.html

ahmadshakil avatar Nov 26 '25 05:11 ahmadshakil

Hi @ahmadshakil! I'm Dosu and I’m helping the ragflow team.

When you enable "Excel to Html" parsing, chunking is intentionally handled differently than for plain text. The chunking logic in RAGFlow's Excel parser groups rows into much larger chunks (default: 256 rows per chunk), and recent changes set the chunk size enforcement to zero for this mode, so the usual 12-line limit is not applied. This is by design to preserve the integrity of tables and avoid splitting them in ways that could break their structure or meaning. There is currently no user-configurable way to enforce a 12-line chunk limit for Excel to HTML parsing. If you need finer chunking for tabular data, a possible workaround is to convert your data to JSONL format, which allows one row per chunk during ingestion, but this does not apply directly to Excel to HTML parsing. For more details, see the relevant code and recent changes: PR #11519, Excel parser implementation, and discussion.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 26 '25 05:11 dosubot[bot]

What do you mean by “exceeding 12 lines”? Also, where exactly did you use Excel-to-HTML conversion? Are you referring to the message component that downloads output in HTML format?

Magicbook1108 avatar Nov 27 '25 01:11 Magicbook1108

I am referring option in Dataset configuration for Excel files

ahmadshakil avatar Nov 27 '25 07:11 ahmadshakil

@ahmadshakil Hi, thanks for sharing the HTML file. Based on our reproduction, the current chunking rule is 12 lines per chunk; if the header row is included, it becomes 13 lines in total (header + 12 rows). Could you clarify what exactly is confusing or unexpected on your side?

buua436 avatar Nov 27 '25 09:11 buua436

When I parse file Financials Sample Data Report.xlsx why more then 13 lines are created in the chunk (abc.html is actually the chunk)

ahmadshakil avatar Dec 01 '25 05:12 ahmadshakil

@ahmadshakil I see that abc.html has 13 lines.

For reference, I’ve included my parsing configuration and the output below. Could you please point out how it differs from your setup/results?

Image Image Image

buua436 avatar Dec 01 '25 05:12 buua436

You are right, chunk has 13 lines. I was confusing raptor cluster with individual chunks So what i understood now is in case of raptor multiple chunks are sent and abc.html contain one clustor.

ahmadshakil avatar Dec 01 '25 06:12 ahmadshakil