camelot
camelot copied to clipboard
Use multiprocessing to parallely process PDF pages
>>> camelot.read_pdf('filename.pdf', pages='all', parallel=True)
We could try and use all cores present on the machine using multiprocessing. More ideas are welcome.
Hi @vinayak-mehta ,
Even I thought of implementing this. dramatiq or celery are my suggestions for asynchronous processing of pages.
I'm doing this with dask but it's chosen out of habit.
Is there any improvement in there? I have a file that has only one page. The page has a table (25 rows x 13 columns). read_pdf
function takes 10 seconds after that to_excel
takes only 100-150 ms. I'm thinking about 10 seconds is too long, am I wrong?
@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.
But does anyone have a solution for multiple pages in parallel?
@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.
Yes!
But does anyone have a solution for multiple pages in parallel?
Using multiprocessing, we should be able to distribute multiple pages on all cores, processing them in parallel.
I get this though
objc[53475]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
Oh; what is the difference between https://github.com/atlanhq/camelot and https://github.com/camelot-dev/camelot ? Didn't notice two repos before now...
@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.
But does anyone have a solution for multiple pages in parallel?
Yeah, I know. Actually it's related with that but the issue was closed and referenced to it.
Does anyone have an update? I've tried inheriting PageHandler and making pages multithreaded / multicore, and multi threading processing multiple pdfs, but I'm running into a ghostscript error (seems like it's not thread safe?)
I did implement a multi-threading layer above camelot.read_pdf using multiprocessing library. I faced a couple of pitfalls doing it, so I can help on this if I may.
@phoewass That would be awesome if you're still interested!
can anyone tell me how to use multiprocess in camelot ? or did this issues still on progress ?
Hi all. Sorry it took me a while to publish the PR while the code was already available. Now the PR is there to be reviewed, I'm looking forward for your feedback.
👀