[Question]: Excel to Html Chunks exceeding 12 lines
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
Describe your problem
Why during parsing if Excel to Html option is enabled chunk have more then 12 lines? Where as per functionality information it should not exceed 12 lines. Attached is the excel and created chunk
Financials Sample Data Report.xlsx
Chunk content -> abc.html
Hi @ahmadshakil! I'm Dosu and I’m helping the ragflow team.
When you enable "Excel to Html" parsing, chunking is intentionally handled differently than for plain text. The chunking logic in RAGFlow's Excel parser groups rows into much larger chunks (default: 256 rows per chunk), and recent changes set the chunk size enforcement to zero for this mode, so the usual 12-line limit is not applied. This is by design to preserve the integrity of tables and avoid splitting them in ways that could break their structure or meaning. There is currently no user-configurable way to enforce a 12-line chunk limit for Excel to HTML parsing. If you need finer chunking for tabular data, a possible workaround is to convert your data to JSONL format, which allows one row per chunk during ingestion, but this does not apply directly to Excel to HTML parsing. For more details, see the relevant code and recent changes: PR #11519, Excel parser implementation, and discussion.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
What do you mean by “exceeding 12 lines”? Also, where exactly did you use Excel-to-HTML conversion? Are you referring to the message component that downloads output in HTML format?
I am referring option in Dataset configuration for Excel files
@ahmadshakil Hi, thanks for sharing the HTML file. Based on our reproduction, the current chunking rule is 12 lines per chunk; if the header row is included, it becomes 13 lines in total (header + 12 rows). Could you clarify what exactly is confusing or unexpected on your side?
When I parse file Financials Sample Data Report.xlsx why more then 13 lines are created in the chunk (abc.html is actually the chunk)
@ahmadshakil I see that abc.html has 13 lines.
For reference, I’ve included my parsing configuration and the output below. Could you please point out how it differs from your setup/results?
You are right, chunk has 13 lines. I was confusing raptor cluster with individual chunks So what i understood now is in case of raptor multiple chunks are sent and abc.html contain one clustor.