camelot icon indicating copy to clipboard operation
camelot copied to clipboard

Use multiprocessing to parallely process PDF pages

Open vinayak-mehta opened this issue 5 years ago • 17 comments

>>> camelot.read_pdf('filename.pdf', pages='all', parallel=True)

We could try and use all cores present on the machine using multiprocessing. More ideas are welcome.

vinayak-mehta avatar Jul 05 '19 22:07 vinayak-mehta

Hi @vinayak-mehta ,

Even I thought of implementing this. dramatiq or celery are my suggestions for asynchronous processing of pages.

satheeshkatipomu avatar Sep 20 '19 09:09 satheeshkatipomu

I'm doing this with dask but it's chosen out of habit.

jontis avatar Sep 21 '19 11:09 jontis

Is there any improvement in there? I have a file that has only one page. The page has a table (25 rows x 13 columns). read_pdf function takes 10 seconds after that to_excel takes only 100-150 ms. I'm thinking about 10 seconds is too long, am I wrong?

selcukusta avatar Nov 07 '19 11:11 selcukusta

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

But does anyone have a solution for multiple pages in parallel?

NixBiks avatar Nov 11 '19 12:11 NixBiks

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

Yes!

But does anyone have a solution for multiple pages in parallel?

Using multiprocessing, we should be able to distribute multiple pages on all cores, processing them in parallel.

vinayak-mehta avatar Nov 11 '19 15:11 vinayak-mehta

I get this though

objc[53475]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

Oh; what is the difference between https://github.com/atlanhq/camelot and https://github.com/camelot-dev/camelot ? Didn't notice two repos before now...

NixBiks avatar Nov 11 '19 15:11 NixBiks

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

But does anyone have a solution for multiple pages in parallel?

Yeah, I know. Actually it's related with that but the issue was closed and referenced to it.

selcukusta avatar Nov 12 '19 10:11 selcukusta

Does anyone have an update? I've tried inheriting PageHandler and making pages multithreaded / multicore, and multi threading processing multiple pdfs, but I'm running into a ghostscript error (seems like it's not thread safe?)

rawsh-bt avatar Jun 26 '20 17:06 rawsh-bt

I did implement a multi-threading layer above camelot.read_pdf using multiprocessing library. I faced a couple of pitfalls doing it, so I can help on this if I may.

phoewass avatar Sep 02 '20 00:09 phoewass

@phoewass That would be awesome if you're still interested!

vinayak-mehta avatar Oct 12 '20 15:10 vinayak-mehta

can anyone tell me how to use multiprocess in camelot ? or did this issues still on progress ?

RickyGunawan09 avatar Apr 28 '21 04:04 RickyGunawan09

Hi all. Sorry it took me a while to publish the PR while the code was already available. Now the PR is there to be reviewed, I'm looking forward for your feedback.

phoewass avatar May 01 '21 15:05 phoewass

👀

vinayak-mehta avatar Jun 14 '21 20:06 vinayak-mehta