pdfplumber
pdfplumber copied to clipboard
More info for words within detected tables -- Feature Request
Table.extract()
returns a matrix of strings corresponding to the table cells. It would be useful to get the individual word bounding boxes (through an option) like what Page.extract_words() returns. Is this already available?
Table.extract() returns the following:
[
['r1 c1', 'r1 c2', 'r1 c3'],
['r2 c1', 'r2 c2', 'r2 c3'],
]
Table.extract(word_boxes=True)
could return those 12 words in the exact same format as Page.extract_words() does.
It's possible to get the page words and map them to cells. But the page words are sometimes merged ignoring column boundaries as reported here.
I agree! That could be a very nice feature. I will consider adding it in the future. In the meantime, PRs are welcome on this.
@greddyatpt 解决了没有
Sorry, I deal with Chinese PDFs but don't know Chinese. Google says 解决了没有 means "Solved". Are you saying the feature is implemented?
抱歉,我处理中文PDF但不懂中文。Google说解决了没有的意思是“已解决”。您是说该功能已实现吗?
Have you implemented this function? Make the text extracted from the table contain the following information: text、fontname 、size、adv 、upright 、height 、width 、x0、x1、 y0 、y1、top 、bottom、doctop 、object_type
Sorry, I deal with Chinese PDFs but don't know Chinese. Google says 解决了没有 means "Solved". Are you saying the feature is implemented?
“解决了没有” = "解决了没有?" = “Has this feature been implemented?”
抱歉,我处理中文PDF但不懂中文。Google说解决了没有的意思是“已解决”。您是说该功能已实现吗?
Have you implemented this function? Make the text extracted from the table contain the following information: text、fontname 、size、adv 、upright 、height 、width 、x0、x1、 y0 、y1、top 、bottom、doctop 、object_type
these informations (text, fontname, size, ...) associated with chars of page. Page.extract_table() returns only text information, you can make a slight modify to return chars which contains all these informations.