NotImplementedError: File format not supported
for some pdf links i am getting this error NotImplementedError: File format not supported
[<ipython-input-11-0615a449639b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 tables = camelot.read_pdf('https://downloads.usda.library.cornell.edu/usda-esmis/files/cj82k728n/2v23wr658/v405t658m/wwcb2921.pdf', pages='1', flavor='lattice')
2 frames
[/usr/local/lib/python3.10/dist-packages/camelot/utils.py](https://localhost:8080/#) in download_url(url)
87 content_type = obj.info().get_content_type()
88 if content_type != "application/pdf":
---> 89 raise NotImplementedError("File format not supported")
90 f.write(obj.read())
91 filepath = os.path.join(os.path.dirname(f.name), filename)
NotImplementedError: File format not supported
Steps to reproduce the bug run below code to reproduce the error.
tables = camelot.read_pdf('https://downloads.usda.library.cornell.edu/usda-esmis/files/cj82k728n/2v23wr658/v405t658m/wwcb2921.pdf', pages='1', flavor='lattice')
Expected behavior
list of tables was expected
https://downloads.usda.library.cornell.edu/usda-esmis/files/cj82k728n/2v23wr658/v405t658m/wwcb2921.pdf
Screenshots
Environment
Linux-6.1.85+-x86_64-with-glibc2.35 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] NumPy 1.26.4 OpenCV 4.10.0 Camelot 0.8.2
also tried Linux-6.1.85+-x86_64-with-glibc2.35 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] NumPy 1.26.4 OpenCV 4.10.0 Camelot 0.9.0
and Linux-6.1.85+-x86_64-with-glibc2.35 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] NumPy 1.26.4 OpenCV 4.10.0 Camelot 0.11.0
Hey!
As https://github.com/camelot-dev/camelot/issues/343, we try to build a maintained fork at pypdf_table_extraction.
Can you check with the latest code over there if the issue still exsists? Please open a issue there if so.
@MartinThoma @vinayak-mehta @bosd I am facing the same error as Kushal, Expected Output: List of tables Standard Output since this week: "Attribute Error: File Format not supported". Could you please let me know if a fix has been deployed on the forked branch, this was working a week ago and for my particular use case lattice boundary provided exclusively in camelot-py[cv] is required.
Could you please let me know if a fix has been deployed on the forked branch,
I assume the fork is ok. The tests are passing there.
Please test your use case with a fresh pip install of pypdf_table_extraction.
If that doesn't work. Please install from source from the main branch.
If you still encounter an error. Please open an issue on the new repo.
@MartinThoma @bosd @vinayak-mehta Tried installing the main branch of forked branch as per your suggestion. Could you please add an example usage of how camelot has to be imported post installing pypdf-table-extraction via github main branch. Also added the issue to the forked branch, please tag the active maintainers https://github.com/py-pdf/pypdf_table_extraction/issues/63
Just looked at this, the behavior persists in the current release (v1.0.0) from this repo but I believe the package is functioning as intended.
For the link first referenced in this issue, the content-type tag of the HTTP response object, returned from requesting that url from the hosting server, is text/plain. The returned object is only processed by camelot if the content-type tag is exactly application/pdf otherwise, the not implemented error is raised, as observed. The content-type tag is part of the HTTP response from the server so there is nothing camelot can do about this. If this issue occurred recently and documents from this source have previously been processed by camelot without issue, then it is likely there was some change on the server side.
Why was the content-type tag text/plain when it looks like a pdf? This can occur for several reasons: (1) that is the content type as listed in the HTTP message header from the server, (2) the content-type field is invalid so it defaults to text/plain, or (3) there is no content-type field in the HTTP response, so it defaults to text/plain. See here for details. In the case of the link above, there is no content-type header field in the HTTP response.
See this for more on possible content-type tags.
One option for users to resolve this is to first download the file at the link as a pdf and then process with camelot as you would a local pdf file. This could work if you are confident your links are always pdfs, versus relying on the content-type tag in the HTTP response as camelot does.
Perhaps providing this recommendation (users download first if they are sure the link is a pdf) in the error message to users would be an appropriate change to camelot?
Closing this issue as the package functions as intended.