pdftitle
pdftitle copied to clipboard
Couldn't extract title from a PDF with first page image
❯ pdftitle -p .\Downloads\test.pdf
Traceback (most recent call last):
File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 701, in run
title = get_title_from_file(args.pdf)
File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 581, in get_title_from_file
return get_title_from_io(raw_file)
File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 476, in get_title_from_io
dev.recover_last_paragraph()
File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 341, in recover_last_paragraph
raise Exception("current block is None, this might be a bug. " +
Exception: current block is None, this might be a bug. please report it together with the pdf file
# Using pdfminer's pdf2txt
➜ pdf2txt .\Downloads\test.pdf
C++/CLI in Action
# Using poppler/xpdf's pdftotext
➜ pdftotext .\Downloads\test.pdf -
C++/CLI in Action
Here is the file: test.pdf
You have to use the --page-number argument. pdftitle does not check all the file, it only checks a single page (first page by default).
$ pdftitle -p test.pdf --page-number 2
C++/CLI in Action
@metebalci Since it can't be known before-hand which PDFs will have title on first page.
Don't you think a better option would be to specify the last page that is checked? By default --last-page-number
would be 1, so only 1st would be check. But I could set --last-page-number
to something like 2 or 3 where title would be detected in the FIRST 3 pages.
BTW, I use pdftitle in a script that renames PDFs with their titles: https://github.com/dufferzafar/.scripts/blob/master/pdf-titles
For an ultimate tool to extract a title from anywhere in a PDF file, this would be correct, but it is pretty difficult to do this I think with traditional methods (I mean without using something more smart from gestalt theory etc.). The main purpose of the tool is to extract titles of (peer-reviewed) articles and they do not have a cover page and they usually have a simple layout. On the other hand, I am not 100% sure but it might not be difficult to implement what you say and it might have some use. So I reopen the issue, I will check this when I do some implementation. So the changes can be:
- deprecate but do not remove --page-number, defaults to 1
- introduce --first-page-number, defaults to --page-number
- introduce --last-page-number (inclusive), defaults to --first-page-number. If --last-page-number is different and the actual number of pages is less than this, I guess it makes sense to terminate the process silently at the end of the document.