pdfplumber
pdfplumber copied to clipboard
Handling merged cells (possible solution)
I've come up some what working code which can handle merged cells via building cell-to-cell map.
I can't describe well what and how i done in code due to lack of english knowlege, but you can look at my code here, main magic is happening in TableExtractor.build_skeleton
and Table.build_table
This result of your library:
This is skeleton map
generated by my code:
And here is result of my code analyzing lines and points after your library
Table.global_map
contains mapped cell, for example 0-0 and 0-1 points to same cell with text Peripheral
.
I hope this will help you!
Also, I want to know, can pdfplumber
extract vertical text in cells?
When i try to parse this pdf for example. I get a lot of this s s e al-l st y Cr Y H P + d e e p S ull- F B S U
and B) K y ( or m e M h s a Fl al ot T
(all new lines were removed).
I found way to fix them, but it's not relible, sometimes it gives somehing like this SSPI PI) QTY (#Chip Select of Each
, because there were 2 cols of text in one cell
I hope you can do something with this.
I've come up some what working code which can handle merged cells via building cell-to-cell map.
Very neat, and thank you for sharing! I think it will be helpful, indeed.
Also, I want to know, can pdfplumber extract vertical text in cells?
Unfortunately, pdfplumber
does not currently have any special handling of vertical text. You may, however, be able to detect and parse vertical text from the raw page.chars
objects.
Hi there, when i saw the picture @REDxEYE shared, i know we might go through the same road, here is what i write for tomorrow's presentation, @REDxEYE would you mind have a look if we are sharing the same logic? "" First, let’s recall a one cell table. As we know each cell can be represented as a rectangle area in pdf pane with bbox defines its location. With this attribute, we can create a fake cell object with same bbox but no information in it. Let’s consider a table in pdf looks like below.
By computing all x values and y values, we can get two array, X =[x1, x2, x3] with length n = 3 Y = [y1, y2, y3] with length p =3 Then create a (n-1, p-1) zero matrix as a base matrix indicate the table, each location represent a cell, and we can reach the cell by its row and column index. In this case, the base matrix would be [[0, 0]; [0, 0]] Next, create (n-1)*(p-1) fake cells using values in array x and y. For instance, Fake cell B (x1, y1, x2, y2); Fake cell A (x1, y2, x2, y3); Fake cell C (x2, y1, x3, y2); Fake cell D (x2, y2, x3, y3) Then assign each Fake cell a base matrix but change its corresponding location to 1, Fake cell B (x1, y1, x2, y2); [[0, 0], [1, 0]] Fake cell A (x1, y2, x2, y3); [[1, 0], [0, 0]] Fake cell C (x2, y1, x3, y2); [[0, 0], [0, 1]] Fake cell D (x2, y2, x3, y3); [[0, 1], [0, 0]] As you may noticed that there is only three cells in the table but we create 4 fake cells to represent it. Therefore, in this step we need to put fake cells into real cells. For instance, Real cell A (x1, y2, x3, y3), contains Fake cell A (x1, y2, x2, y3); [[1, 0], [0, 0]] plus Fake cell D (x2, y2, x3, y3); [[0, 1], [0, 0]] Therefore, we can assign Real cell A (x1, y2, x3, y3) a matrix by adding up [[1, 0], [0, 0]] and [[0, 1], [0, 0]], which is [[1, 1], [0, 0]]. Now we know Real cell A is a merged cell.
You can check my code here https://github.com/shuratn/py_pdf_stm/blob/master/TableExtractor.py
Hi thanks for reply, when i saw this part in your code, i think we are in the same way! shake shake~, 👍 :) but in my thinking, there are 8 directions for point relationships, in total 2 to the power 8 situation. i encounter this when pdfminer returns a real Rectangle at the outside of the table. Did you have the same issue?
Hi, thank you very much for sharing your code. I'm using it for a university project and it works great. I find myself in a situation where I only need to understand which cells are joined horizontally. Is it possible to manage horizontally joined cells and insert a None instead of those joined vertically? Thanks for your help
It's been 4 years since i implemented it, i dont really remember how i did that. I might invest some time and implement cleaner code to extract tables
Thank you all the same. You did a great job!
Thank you for the code. I had the same situation. It's working great.