pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

More info for words within detected tables -- Feature Request

Open ghost opened this issue 4 years ago • 6 comments

Table.extract() returns a matrix of strings corresponding to the table cells. It would be useful to get the individual word bounding boxes (through an option) like what Page.extract_words() returns. Is this already available?

Table.extract() returns the following:
    [
        ['r1 c1', 'r1 c2', 'r1 c3'],
        ['r2 c1', 'r2 c2', 'r2 c3'],
    ]

Table.extract(word_boxes=True) could return those 12 words in the exact same format as Page.extract_words() does.

It's possible to get the page words and map them to cells. But the page words are sometimes merged ignoring column boundaries as reported here.

ghost avatar Feb 10 '20 01:02 ghost

I agree! That could be a very nice feature. I will consider adding it in the future. In the meantime, PRs are welcome on this.

jsvine avatar Apr 08 '20 01:04 jsvine

@greddyatpt 解决了没有

andlike avatar Jul 13 '20 02:07 andlike

Sorry, I deal with Chinese PDFs but don't know Chinese. Google says 解决了没有 means "Solved". Are you saying the feature is implemented?

ghost avatar Jul 13 '20 04:07 ghost

抱歉,我处理中文PDF但不懂中文。Google说解决了没有的意思是“已解决”。您是说该功能已实现吗?

Have you implemented this function? Make the text extracted from the table contain the following information: text、fontname 、size、adv 、upright 、height 、width 、x0、x1、 y0 、y1、top 、bottom、doctop 、object_type

zhushiyuan-star avatar Mar 31 '21 09:03 zhushiyuan-star

Sorry, I deal with Chinese PDFs but don't know Chinese. Google says 解决了没有 means "Solved". Are you saying the feature is implemented?

“解决了没有” = "解决了没有?" = “Has this feature been implemented?”

wind-chh avatar Apr 01 '21 00:04 wind-chh

抱歉,我处理中文PDF但不懂中文。Google说解决了没有的意思是“已解决”。您是说该功能已实现吗?

Have you implemented this function? Make the text extracted from the table contain the following information: text、fontname 、size、adv 、upright 、height 、width 、x0、x1、 y0 、y1、top 、bottom、doctop 、object_type

these informations (text, fontname, size, ...) associated with chars of page. Page.extract_table() returns only text information, you can make a slight modify to return chars which contains all these informations.

wind-chh avatar Apr 01 '21 00:04 wind-chh