camelot icon indicating copy to clipboard operation
camelot copied to clipboard

NotImplementedError: File format not supported

Open kushalmraut opened this issue 1 year ago • 4 comments

for some pdf links i am getting this error NotImplementedError: File format not supported

[<ipython-input-11-0615a449639b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 tables = camelot.read_pdf('https://downloads.usda.library.cornell.edu/usda-esmis/files/cj82k728n/2v23wr658/v405t658m/wwcb2921.pdf', pages='1', flavor='lattice')

2 frames
[/usr/local/lib/python3.10/dist-packages/camelot/utils.py](https://localhost:8080/#) in download_url(url)
     87         content_type = obj.info().get_content_type()
     88         if content_type != "application/pdf":
---> 89             raise NotImplementedError("File format not supported")
     90         f.write(obj.read())
     91     filepath = os.path.join(os.path.dirname(f.name), filename)

NotImplementedError: File format not supported

Steps to reproduce the bug run below code to reproduce the error.

tables = camelot.read_pdf('https://downloads.usda.library.cornell.edu/usda-esmis/files/cj82k728n/2v23wr658/v405t658m/wwcb2921.pdf', pages='1', flavor='lattice')

Expected behavior

list of tables was expected

PDF

https://downloads.usda.library.cornell.edu/usda-esmis/files/cj82k728n/2v23wr658/v405t658m/wwcb2921.pdf

Screenshots image

Environment

Linux-6.1.85+-x86_64-with-glibc2.35 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] NumPy 1.26.4 OpenCV 4.10.0 Camelot 0.8.2

also tried Linux-6.1.85+-x86_64-with-glibc2.35 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] NumPy 1.26.4 OpenCV 4.10.0 Camelot 0.9.0

and Linux-6.1.85+-x86_64-with-glibc2.35 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] NumPy 1.26.4 OpenCV 4.10.0 Camelot 0.11.0

kushalmraut avatar Aug 07 '24 09:08 kushalmraut

Hey!

As https://github.com/camelot-dev/camelot/issues/343, we try to build a maintained fork at pypdf_table_extraction.

Can you check with the latest code over there if the issue still exsists? Please open a issue there if so.

bosd avatar Aug 07 '24 10:08 bosd

@MartinThoma @vinayak-mehta @bosd I am facing the same error as Kushal, Expected Output: List of tables Standard Output since this week: "Attribute Error: File Format not supported". Could you please let me know if a fix has been deployed on the forked branch, this was working a week ago and for my particular use case lattice boundary provided exclusively in camelot-py[cv] is required.

jatinchhabriya avatar Aug 20 '24 07:08 jatinchhabriya

Could you please let me know if a fix has been deployed on the forked branch,

I assume the fork is ok. The tests are passing there.

Please test your use case with a fresh pip install of pypdf_table_extraction.

If that doesn't work. Please install from source from the main branch.

If you still encounter an error. Please open an issue on the new repo.

bosd avatar Aug 20 '24 11:08 bosd

@MartinThoma @bosd @vinayak-mehta Tried installing the main branch of forked branch as per your suggestion. Could you please add an example usage of how camelot has to be imported post installing pypdf-table-extraction via github main branch. Also added the issue to the forked branch, please tag the active maintainers https://github.com/py-pdf/pypdf_table_extraction/issues/63

jatinchhabriya avatar Aug 22 '24 06:08 jatinchhabriya

Just looked at this, the behavior persists in the current release (v1.0.0) from this repo but I believe the package is functioning as intended.

For the link first referenced in this issue, the content-type tag of the HTTP response object, returned from requesting that url from the hosting server, is text/plain. The returned object is only processed by camelot if the content-type tag is exactly application/pdf otherwise, the not implemented error is raised, as observed. The content-type tag is part of the HTTP response from the server so there is nothing camelot can do about this. If this issue occurred recently and documents from this source have previously been processed by camelot without issue, then it is likely there was some change on the server side.

Why was the content-type tag text/plain when it looks like a pdf? This can occur for several reasons: (1) that is the content type as listed in the HTTP message header from the server, (2) the content-type field is invalid so it defaults to text/plain, or (3) there is no content-type field in the HTTP response, so it defaults to text/plain. See here for details. In the case of the link above, there is no content-type header field in the HTTP response.

See this for more on possible content-type tags.

One option for users to resolve this is to first download the file at the link as a pdf and then process with camelot as you would a local pdf file. This could work if you are confident your links are always pdfs, versus relying on the content-type tag in the HTTP response as camelot does.

Perhaps providing this recommendation (users download first if they are sure the link is a pdf) in the error message to users would be an appropriate change to camelot?

DoomedJupiter avatar Mar 14 '25 19:03 DoomedJupiter

Closing this issue as the package functions as intended.

DoomedJupiter avatar Oct 29 '25 20:10 DoomedJupiter