camelot icon indicating copy to clipboard operation
camelot copied to clipboard

Allow read_pdf to accept a file-like object

Open Lnk2past opened this issue 5 years ago • 12 comments
trafficstars

In our use case we have PDF data streamed in memory from an external service; in order for us to process it using camelot we need to save that data out to a file and then pass the filename over. It would be great to be able to just send a file-like object through the interface instead, as this would save us from needing to write temporary files only to read them back in. I do not think there is a workaround for this at the moment, but if there is any information would be greatly appreciated.

I do not know if I will have time immediately soon to work on a PR, but does this sound like a reasonable feature to add?

Lnk2past avatar Dec 06 '19 15:12 Lnk2past

in this other repository (https://github.com/atlanhq/camelot) (I assume the original one?) there are already two merge requests pending and aiting to get accepted for this issue:

  • https://github.com/atlanhq/camelot/pull/376
  • https://github.com/atlanhq/camelot/pull/331

Maybe we can do this quickly with that ;). I think this is really a feature that a lot of poeple would like to have ...

yeus avatar Oct 05 '20 21:10 yeus

Thanks for pointing that out! Right now #13 is taking up a lot of my time, but I will try to get to this over the weekend.

vinayak-mehta avatar Oct 05 '20 22:10 vinayak-mehta

For poeple where the main problem is, that you want to keep the file "in-memory" for example as a spooled temporary file, a short workaround could be the following:

use this library here: https://github.com/mbello/memory-tempfile to create a file on a a tmpfs in our memory. This soluion only works for linux though ... Additionally, its difficult to do this in docker images or on kubernetes.

yeus avatar Oct 05 '20 22:10 yeus

@vinayak-mehta just saw your comment. Looking forward to this! If you need any help (testing, review...) just contact me ;) although I am not that deep into the library ...

yeus avatar Oct 05 '20 22:10 yeus

Thanks for the suggestion, and for offering your help! I will try to get to the PRs by the weekend and will definitely comment here if I need help :)

vinayak-mehta avatar Oct 05 '20 22:10 vinayak-mehta

I mentioned another use case for this in https://github.com/atlanhq/camelot/pull/189, where reading from file-like object would come in handy when more advanced authentication is required for websites (e.g. SharePoint), requiring pulling the object using a library like requests.

pilotjoe avatar Oct 08 '20 19:10 pilotjoe

@pilotjoe Thank you for your comment describing the use-case.

Last week, I ended up spending a lot of time on #13. Will get to this soon.

vinayak-mehta avatar Oct 12 '20 16:10 vinayak-mehta

Hey @vinayak-mehta , just checking in if you got around to doing this?

yash12392 avatar Mar 09 '21 15:03 yash12392

Would love this feature to be implemented. The use case is an AWS Lambda function that has read a pdf from S3, processed it with regex to find relevant pages then we wish to pass the relevant pages as bytes to a table extraction package, ideally without having to write/read to/from file again in the Lambda.

HeskethGD avatar Jun 20 '22 12:06 HeskethGD