camelot icon indicating copy to clipboard operation
camelot copied to clipboard

[MRG] Utils: optimise get_page_layout

Open karlowich opened this issue 2 years ago • 4 comments

Since the existing code overwrites layout and dim in each iteration, it is much more efficient to simply return the layout and dim of the first page.

I have tested the difference with a 455 page pdf and the optimisation reduces the time spent from 50 to 5 seconds.

Signed-off-by: Karl Bonde Torp [email protected]

karlowich avatar Apr 29 '22 08:04 karlowich

Granted the code iterates over all applicable pages and returns a single layout, dim I suspect there is a not so subtle defect.

It assumes that all pages are the same.

Is this a valid assumption? Can a single pdf contain multiple dimensions?

An example might be a merged pdf containing pdfs in Letter-size and A4-size. (I'm not a pdf expert and don't know the details in such a document, but believe it needs to be considered.)

A possible best solution would be to assume all pages are the same, and provide an option indicating that the pdf contains dissimilar page sizes.

This facilitates the speed up of the common case, yet provides facility for the unique case where paying the cost of per page examination makes or breaks the ability to recognize different pages.

ottohirr avatar Sep 06 '22 16:09 ottohirr

You make a good point.

I wonder what the width and height will be in the case of multiple dimensions.

Regardless of single or multiple dimensions, overwriting variables in a loop doesn't make a lot of sense. Perhaps only the interpreter.process_page(page) should be inside the loop, and the rest should be called when it is done.

Besides, it appears that this project is without a maintainer, and whatever conclusion we reach will be irrelevant.

karlowich avatar Sep 08 '22 11:09 karlowich

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?

MartinThoma avatar Feb 25 '24 11:02 MartinThoma

Hi @MartinThoma, I have opened a PR against the fork:

https://github.com/py-pdf/pypdf_table_extraction/pull/5

karlowich avatar Feb 27 '24 11:02 karlowich