python-pdftables-api icon indicating copy to clipboard operation
python-pdftables-api copied to clipboard

API not working for files over 100KB

Open jmbanda opened this issue 6 months ago • 4 comments

Greetings,

I am submitting a large set of files and only smaller files under 100KB are getting processed all others do not error out or provide any error message. I have adjusted the timeout parameter and this does not fix the issue.

Thanks!

jmbanda avatar Dec 20 '23 09:12 jmbanda

@jmbanda: sorry for the delayed reply, you caught us over the holiday period.

Is this still an issue? If so, is it possible to provide us with the example code and PDFs to try and reproduce the error?

StevenMaude avatar Jan 08 '24 17:01 StevenMaude

Greetings, yes, this continues to be an issue. I can't provide the PDF as it is private, but any PDF above 100KB was failing with the following code:

import pdftables_api

c = pdftables_api.Client('my-api-key', timeout=(60, 3600))
c.xlsx('input.pdf', 'output.xlsx')

Same happens with or without the timeout parameter. We still have plenty of pages left in our paid bundle, so that is not the issue. There is no error being thrown, it just skips the documents. If we input the document on the web UI manually, it works well.

jmbanda avatar Jan 08 '24 17:01 jmbanda

Thanks; we'll add it to our issue queue and take a look, then report back (it may be a few days).

StevenMaude avatar Jan 08 '24 18:01 StevenMaude

Just to follow up, I've tested the code here on a fresh Ubuntu 22.04 virtual machine and can't reproduce the issue. This was using Python 3.10 that came bundled with the operating system.

I did the following:

  1. Created a virtualenv with python3 -m venv api

  2. Activated the virtualenv with source api/bin/activate to activate the virtualenv

  3. Ran pip install git+https://github.com/pdftables/python-pdftables-api.git to install the API code.

  4. Converted a test PDF named input.pdf of size 360 KB with the following code (edited to include my actual API key):

    import pdftables_api
    
    c = pdftables_api.Client('my-api-key', timeout=(60, 3600))
    c.xlsx('input.pdf', 'output.xlsx')
    

This produced an output Excel file named output.xlsx.

If you can give any more details about the environment in which the code was failing, we can try and reproduce further. It's tricky to fix without encountering the problem, unfortunately.

StevenMaude avatar Jan 22 '24 10:01 StevenMaude