pdfplumber Handling merged cells (possible solution)

Handling merged cells (possible solution)

Open REDxEYE opened this issue 5 years ago • 8 comments

I've come up some what working code which can handle merged cells via building cell-to-cell map. I can't describe well what and how i done in code due to lack of english knowlege, but you can look at my code here, main magic is happening in TableExtractor.build_skeleton and Table.build_table

This result of your library: page-15-0_im

This is skeleton map generated by my code: page-15-0-skeleton

And here is result of my code analyzing lines and points after your library page-15-0

Table.global_map contains mapped cell, for example 0-0 and 0-1 points to same cell with text Peripheral.

I hope this will help you!

Also, I want to know, can pdfplumber extract vertical text in cells? When i try to parse this pdf for example. I get a lot of this s s e al-l st y Cr Y H P + d e e p S ull- F B S U and B) K y ( or m e M h s a Fl al ot T (all new lines were removed). I found way to fix them, but it's not relible, sometimes it gives somehing like this SSPI PI) QTY (#Chip Select of Each, because there were 2 cols of text in one cell 2018-10-04 12_20_20-kinetis l series - selector guide

I hope you can do something with this.

Oct 04 '18 09:10 REDxEYE

I've come up some what working code which can handle merged cells via building cell-to-cell map.

Very neat, and thank you for sharing! I think it will be helpful, indeed.

Also, I want to know, can pdfplumber extract vertical text in cells?

Unfortunately, pdfplumber does not currently have any special handling of vertical text. You may, however, be able to detect and parse vertical text from the raw page.chars objects.

Oct 11 '18 01:10 jsvine

Hi there, when i saw the picture @REDxEYE shared, i know we might go through the same road, here is what i write for tomorrow's presentation, @REDxEYE would you mind have a look if we are sharing the same logic? "" First, let’s recall a one cell table. As we know each cell can be represented as a rectangle area in pdf pane with bbox defines its location. With this attribute, we can create a fake cell object with same bbox but no information in it. Let’s consider a table in pdf looks like below.

By computing all x values and y values, we can get two array, X =[x1, x2, x3] with length n = 3 Y = [y1, y2, y3] with length p =3 Then create a (n-1, p-1) zero matrix as a base matrix indicate the table, each location represent a cell, and we can reach the cell by its row and column index. In this case, the base matrix would be [[0, 0]; [0, 0]] Next, create (n-1)*(p-1) fake cells using values in array x and y. For instance, Fake cell B (x1, y1, x2, y2); Fake cell A (x1, y2, x2, y3); Fake cell C (x2, y1, x3, y2); Fake cell D (x2, y2, x3, y3) Then assign each Fake cell a base matrix but change its corresponding location to 1, Fake cell B (x1, y1, x2, y2); [[0, 0], [1, 0]] Fake cell A (x1, y2, x2, y3); [[1, 0], [0, 0]] Fake cell C (x2, y1, x3, y2); [[0, 0], [0, 1]] Fake cell D (x2, y2, x3, y3); [[0, 1], [0, 0]] As you may noticed that there is only three cells in the table but we create 4 fake cells to represent it. Therefore, in this step we need to put fake cells into real cells. For instance, Real cell A (x1, y2, x3, y3), contains Fake cell A (x1, y2, x2, y3); [[1, 0], [0, 0]] plus Fake cell D (x2, y2, x3, y3); [[0, 1], [0, 0]] Therefore, we can assign Real cell A (x1, y2, x3, y3) a matrix by adding up [[1, 0], [0, 0]] and [[0, 1], [0, 0]], which is [[1, 1], [0, 0]]. Now we know Real cell A is a merged cell.

Jul 30 '19 13:07 jackyjingyi

You can check my code here https://github.com/shuratn/py_pdf_stm/blob/master/TableExtractor.py

Jul 30 '19 17:07 REDxEYE

Hi thanks for reply, when i saw this part in your code, i think we are in the same way! shake shake~, 👍 :) but in my thinking, there are 8 directions for point relationships, in total 2 to the power 8 situation. i encounter this when pdfminer returns a real Rectangle at the outside of the table. Did you have the same issue?

Jul 31 '19 01:07 jackyjingyi

Hi, thank you very much for sharing your code. I'm using it for a university project and it works great. I find myself in a situation where I only need to understand which cells are joined horizontally. Is it possible to manage horizontally joined cells and insert a None instead of those joined vertically? Thanks for your help

Oct 13 '22 16:10 raffaele96x

It's been 4 years since i implemented it, i dont really remember how i did that. I might invest some time and implement cleaner code to extract tables

Oct 13 '22 16:10 REDxEYE

Thank you all the same. You did a great job!

Oct 13 '22 16:10 raffaele96x

Thank you for the code. I had the same situation. It's working great.

Nov 10 '22 06:11 giriannamalai

pdfplumber pdfplumber copied to clipboard

Handling merged cells (possible solution)

pdfplumber
pdfplumber copied to clipboard