camelot
camelot copied to clipboard
Allow read_pdf to accept a file-like object
In our use case we have PDF data streamed in memory from an external service; in order for us to process it using camelot we need to save that data out to a file and then pass the filename over. It would be great to be able to just send a file-like object through the interface instead, as this would save us from needing to write temporary files only to read them back in. I do not think there is a workaround for this at the moment, but if there is any information would be greatly appreciated.
I do not know if I will have time immediately soon to work on a PR, but does this sound like a reasonable feature to add?
in this other repository (https://github.com/atlanhq/camelot) (I assume the original one?) there are already two merge requests pending and aiting to get accepted for this issue:
- https://github.com/atlanhq/camelot/pull/376
- https://github.com/atlanhq/camelot/pull/331
Maybe we can do this quickly with that ;). I think this is really a feature that a lot of poeple would like to have ...
Thanks for pointing that out! Right now #13 is taking up a lot of my time, but I will try to get to this over the weekend.
For poeple where the main problem is, that you want to keep the file "in-memory" for example as a spooled temporary file, a short workaround could be the following:
use this library here: https://github.com/mbello/memory-tempfile to create a file on a a tmpfs in our memory. This soluion only works for linux though ... Additionally, its difficult to do this in docker images or on kubernetes.
@vinayak-mehta just saw your comment. Looking forward to this! If you need any help (testing, review...) just contact me ;) although I am not that deep into the library ...
Thanks for the suggestion, and for offering your help! I will try to get to the PRs by the weekend and will definitely comment here if I need help :)
I mentioned another use case for this in https://github.com/atlanhq/camelot/pull/189, where reading from file-like object would come in handy when more advanced authentication is required for websites (e.g. SharePoint), requiring pulling the object using a library like requests.
@pilotjoe Thank you for your comment describing the use-case.
Last week, I ended up spending a lot of time on #13. Will get to this soon.
Hey @vinayak-mehta , just checking in if you got around to doing this?
Would love this feature to be implemented. The use case is an AWS Lambda function that has read a pdf from S3, processed it with regex to find relevant pages then we wish to pass the relevant pages as bytes to a table extraction package, ideally without having to write/read to/from file again in the Lambda.