pdfx
pdfx copied to clipboard
Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
whenever I used `references = pdf.get_references_as_dict(sort=True)` it would fail saying: ``` File "C:\Users\user\Scripts\PDFx\test.py", line 9, in references = pdf.get_references_as_dict(sort=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\users\user\scripts\pdfx\pdfx\pdfx\__init__.py", line 168, in get_references_as_dict return self.reader.get_references_as_dict(reftype=reftype, sort=sort) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^...
Hi, pdfx is very helpful for us to analyze a few things. Thanks for creating pdfx. But we have a small problem. When a pdf file contains much text pdfx...
Improved the extract_links function to include hyperlinks spanning over two or more lines by replacing line breaks in text (issue #40)
Links that span spill over onto the second line are cut off when being recognized and thus reported as dead.
Global Infrastructure Hosting Platform
Is there any possibility that the original pdf file be modified to make the original link to point to the locally downloaded files? A second, more interesting option would be...
closes #48
Hi I use `pdfx -v path_to_pdf_file` to gather URLs from a PDF. This is great on its own. I would love to see pdfx expand to allow for URL extraction...
Arxiv documents don't have title / author etc metadata. ``` ➜ pdfx https://arxiv.org/pdf/1911.02782.pdf Document infos: - CreationDate = D:20200708010812Z - Creator = LaTeX with hyperref package - ModDate = D:20200708010812Z...
I ran into some problems where a file was hanging because the urlopen command never timed out, so I added an option for users to specify that they want requests...