bug - duplicates merged cell text following issue #2106
still having this duplicated text problem with this kind of table structure :
table doc:
after partition_docx :
python-docx 1.1.2 unstructured 0.14.3
@veredmm I'm getting "HEADER 5 4 3 2 1 AAA BBB CCC" as elements[0].text for that document, which is the expected behavior and does not repeat the text in that merged cell.
The .metadata.text_as_html for that Table element is this uniform 3 row x 8 col table:
<table>
<thead>
<tr>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
</tr>
</thead>
<tbody>
<tr>
<td>5</td>
<td>4</td>
<td>4</td>
<td>3</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
</tr>
</tbody>
</table>
The HTML table in .text_as_html is purposely made "uniform" (same number of cells in each row), which is why the same content appears in each "grid" cell of a merged cell.
If you think that should look differently, please suggest (in HTML) what you think it should look like instead and we'll consider a change.
thanks @scanny I would suggest that the content of the merged cell will appear only in the first cell(td) of the table row and the other cells will be empty
Hello, has this issue been resolved @scanny @veredmm