unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug - duplicates merged cell text following issue #2106

Open veredmm opened this issue 1 year ago • 2 comments

still having this duplicated text problem with this kind of table structure :

merged_table2.docx

table doc:

image

after partition_docx :

image

python-docx 1.1.2 unstructured 0.14.3

veredmm avatar Jun 19 '24 04:06 veredmm

@veredmm I'm getting "HEADER 5 4 3 2 1 AAA BBB CCC" as elements[0].text for that document, which is the expected behavior and does not repeat the text in that merged cell.

The .metadata.text_as_html for that Table element is this uniform 3 row x 8 col table:

  <table>
    <thead>
      <tr>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>5</td>
        <td>4</td>
        <td>4</td>
        <td>3</td>
        <td>2</td>
        <td>2</td>
        <td>1</td>
        <td>1</td>
      </tr>
      <tr>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
      </tr>
    </tbody>
  </table>

The HTML table in .text_as_html is purposely made "uniform" (same number of cells in each row), which is why the same content appears in each "grid" cell of a merged cell.

If you think that should look differently, please suggest (in HTML) what you think it should look like instead and we'll consider a change.

scanny avatar Jun 20 '24 18:06 scanny

thanks @scanny I would suggest that the content of the merged cell will appear only in the first cell(td) of the table row and the other cells will be empty

veredmm avatar Jun 24 '24 04:06 veredmm

Hello, has this issue been resolved @scanny @veredmm

wxh0613 avatar Oct 25 '24 03:10 wxh0613