markitdown
markitdown copied to clipboard
merged cell convert issue,both in excel and pptx
The merged cells in Excel, after being converted to Markdown, are split, resulting in a table with new empty cells,may lead to incorrect or lost information.
before convert
after convert
Markdown tables do not allow for merged cells. This is not related to markitdown but the markdown language itself. HTML allows for this flexibility. It will be more flexible to have a xml-it tool besides the markdown. Anyhow, I think markdown can also support custom html so perhaps the developers can eventually use the xml for tables.
I also encountered similar problems when parsing tables. My work is for very large and complex tables with a lot of merged cells with a large span.
If there is no option to parse merged cells, my table is basically unreadable.
I added an optional option in pr #1165 to support parsing merged cells and headers in Excel and filling values into child cells.
Combined with the expansion of parent and child items of the table that I implemented myself, this can greatly improve LLM's understanding of the table.
@BetterAndBetterII , hey bro, can you show me, how to use fill_merged_cells args in Python code?
Same issue here. Were you guys able to figure it out?
@wei12314 Have you tried? If yes, Do you need to custome code in some files which is committed by BetterAndBetterII in 4 commits?