camelot
camelot copied to clipboard
[MRG] Utils: optimise get_page_layout
Since the existing code overwrites layout
and dim
in each iteration,
it is much more efficient to simply return the layout
and dim
of the
first page.
I have tested the difference with a 455 page pdf and the optimisation reduces the time spent from 50 to 5 seconds.
Signed-off-by: Karl Bonde Torp [email protected]
Granted the code iterates over all applicable pages and returns a single layout, dim
I suspect there is a not so subtle defect.
It assumes that all pages are the same.
Is this a valid assumption? Can a single pdf contain multiple dimensions?
An example might be a merged pdf containing pdfs in Letter-size and A4-size. (I'm not a pdf expert and don't know the details in such a document, but believe it needs to be considered.)
A possible best solution would be to assume all pages are the same, and provide an option indicating that the pdf contains dissimilar page sizes.
This facilitates the speed up of the common case, yet provides facility for the unique case where paying the cost of per page examination makes or breaks the ability to recognize different pages.
You make a good point.
I wonder what the width
and height
will be in the case of multiple dimensions.
Regardless of single or multiple dimensions, overwriting variables in a loop doesn't make a lot of sense.
Perhaps only the interpreter.process_page(page)
should be inside the loop, and the rest should be called when it is done.
Besides, it appears that this project is without a maintainer, and whatever conclusion we reach will be irrelevant.
Hey!
As camelot is dead, we try to build a maintained fork at pypdf_table_extraction
.
Do you want to open the PR against that branch so that we can merge your improvement?
Hi @MartinThoma, I have opened a PR against the fork:
https://github.com/py-pdf/pypdf_table_extraction/pull/5