pdf2docx icon indicating copy to clipboard operation
pdf2docx copied to clipboard

After extract_tables some values have <NEST TABLE>

Open Rustemhak opened this issue 2 years ago • 7 comments

Hello. After application of function extract_tables in some lists I get values <NEST TABLE>. Is it possible to extract data from <NEST TABLE>? If necessary, I can give an example pdf file, where it is.

Rustemhak avatar Feb 05 '23 13:02 Rustemhak

感谢这么好用的工具库 我也遇到同样的问题,表格的单元格内容有时会被解析成<NEST TABLE>, 我发现在表格背景色为灰色的时候很大概率会出现这种问题

pdf中的表格: image

debug: image

请问有什么建议,可以改进么?

macanv avatar Apr 20 '23 01:04 macanv

同问

fefefefefefe avatar May 01 '23 21:05 fefefefefefe

Yes, I also noticed that one of the reasons why <NEST TABLE> appears when other colors are present in the table. I sometimes had a yellow color in my table because of a comment left in a pdf document, so if possible, I removed all such comments manually. Nevertheless, there are still documents where there is a <NEST TABLE> for another reason, I just logged that I did not extract everything from the table.

Rustemhak avatar May 08 '23 19:05 Rustemhak

Is there any update on this issue? I also encounter this when parsing a table with background color. See the testing pdf file and the result screenshot below: image

every even row's(with blue background) header will be parse as <NEST TABLE> test.pdf

icarusxxy avatar Jul 15 '24 06:07 icarusxxy

@icarusxxy This modification in table/Cell.py worked for me, though it's not perfect, but I'd rather have some text than <NEST TABLE>.

    @property
    def text(self):
        '''Text contained in this cell.'''
        if not self: return None
        # NOTE: sub-table may exists in

        text = []

        for block in self.blocks:
            if block.is_text_block:
                text.append(block.text)
            elif block.__class__.__name__ == "TableBlock":
                text.append(''.join(flatten(block.text)))
            else:
                logging.warning(f"Found {block.__class__.__name__} - {block.text}")
                text.append("<NEST TABLE>")

        return '\n'.join(text)

Although, I eventually moved to using pdfplumber as it's working better for documents I am working with.

vivekagr avatar Jul 25 '24 23:07 vivekagr