ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: parse a excel file by table chunk method

Open sanwei111 opened this issue 8 months ago • 10 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

i want to parse a excel file by table chunk method,this excel file have several sheet。

finally,can parse the excel smoothly,by how to distinguish the different sheet chunk????

sanwei111 avatar Mar 31 '25 07:03 sanwei111

To parse an Excel file using the table chunk method and distinguish between different sheet chunks, you can use the html method from the deepdoc/parser/excel_parser.py file in the ragflow codebase. This method processes each sheet in the workbook separately and creates HTML tables for each chunk of rows. Each table includes a caption with the sheet name, which helps distinguish between different sheet chunks.

Here's a simplified explanation of the code:

def html(self, fnm, chunk_rows=256):
    file_like_object = BytesIO(fnm) if not isinstance(fnm, str) else fnm
    wb = RAGFlowExcelParser._load_excel_to_workbook(file_like_object)
    tb_chunks = []
    for sheetname in wb.sheetnames:
        ws = wb[sheetname]
        rows = list(ws.rows)
        if not rows:
            continue

        tb_rows_0 = "<tr>"
        for t in list(rows[0]):
            tb_rows_0 += f"<th>{t.value}</th>"
        tb_rows_0 += "</tr>"

        for chunk_i in range((len(rows) - 1) // chunk_rows + 1):
            tb = ""
            tb += f"<table><caption>{sheetname}</caption>"
            tb += tb_rows_0
            for r in list(
                rows[1 + chunk_i * chunk_rows: 1 + (chunk_i + 1) * chunk_rows]
            ):
                tb += "<tr>"
                for i, c in enumerate(r):
                    if c.value is None:
                        tb += "<td></td>"
                    else:
                        tb += f"<td>{c.value}</td>"
                tb += "</tr>"
            tb += "</table>\n"
            tb_chunks.append(tb)

    return tb_chunks

This method returns a list of HTML table strings, each representing a chunk of rows from a specific sheet, with the sheet name included in the table caption [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar Mar 31 '25 07:03 dosubot[bot]

The sheet name will be concated to the end of all the chunks if you apply General chunking method.

KevinHuSh avatar Mar 31 '25 07:03 KevinHuSh

chunk method use General

OneKe avatar Mar 31 '25 07:03 OneKe

The sheet name will be concated to the end of all the chunks if you apply General chunking method.

thx for replying,sir.

two more question:

  1. what's the function of "table" chunk method? which scene can i use this chunk method?
  2. if i have several sheet, after parse file by "table" method, how to distinuished the chunk's origin?

sanwei111 avatar Mar 31 '25 07:03 sanwei111

chunk method use General

how about “table” chunk method?

sanwei111 avatar Mar 31 '25 07:03 sanwei111

what's the function of "table" chunk method? which scene can i use this chunk method?

It's usually for SQL search to structured data.

KevinHuSh avatar Mar 31 '25 08:03 KevinHuSh

what's the function of "table" chunk method? which scene can i use this chunk method?

It's usually for SQL search to structured data.

thx sir,one more question~ this is origin excel:

Image

this is result:

Image

how to use the SQL search as you say

sanwei111 avatar Mar 31 '25 08:03 sanwei111

If any KB used Table as chunk method, RAGFlow will turn users' questions into ES SQL to query. So, Table methods usually are used for data dumped from DB. In that way, people do not need to write SQL to query.

KevinHuSh avatar Apr 01 '25 01:04 KevinHuSh

If any KB used Table as chunk method, RAGFlow will turn users' questions into ES SQL to query. So, Table methods usually are used for data dumped from DB. In that way, people do not need to write SQL to query.

ok,sir. where can i find the demo of "RAGFlow will turn users' questions into ES SQL to query."? after parse the xlsx to json block by table method, how to query?

now i build a assistant like this:

Image

how to "turn users' questions into ES SQL to query."?

sanwei111 avatar Apr 01 '25 02:04 sanwei111

@sanwei111 In case, there's any KB used Table as chunk method.

KevinHuSh avatar Apr 01 '25 03:04 KevinHuSh