pdftitle icon indicating copy to clipboard operation
pdftitle copied to clipboard

Couldn't extract title from a PDF with first page image

Open dufferzafar opened this issue 3 years ago • 3 comments

❯ pdftitle -p .\Downloads\test.pdf

Traceback (most recent call last):
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 701, in run
    title = get_title_from_file(args.pdf)
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 581, in get_title_from_file
    return get_title_from_io(raw_file)
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 476, in get_title_from_io
    dev.recover_last_paragraph()
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 341, in recover_last_paragraph
    raise Exception("current block is None, this might be a bug. " +
Exception: current block is None, this might be a bug. please report it together with the pdf file

# Using pdfminer's pdf2txt
➜ pdf2txt .\Downloads\test.pdf

C++/CLI in Action

# Using poppler/xpdf's pdftotext
➜ pdftotext .\Downloads\test.pdf -

C++/CLI in Action

Here is the file: test.pdf

dufferzafar avatar Nov 18 '21 19:11 dufferzafar

You have to use the --page-number argument. pdftitle does not check all the file, it only checks a single page (first page by default).

$ pdftitle -p test.pdf --page-number 2
C++/CLI in Action

metebalci avatar Nov 21 '21 12:11 metebalci

@metebalci Since it can't be known before-hand which PDFs will have title on first page.

Don't you think a better option would be to specify the last page that is checked? By default --last-page-number would be 1, so only 1st would be check. But I could set --last-page-number to something like 2 or 3 where title would be detected in the FIRST 3 pages.

BTW, I use pdftitle in a script that renames PDFs with their titles: https://github.com/dufferzafar/.scripts/blob/master/pdf-titles

dufferzafar avatar Nov 21 '21 13:11 dufferzafar

For an ultimate tool to extract a title from anywhere in a PDF file, this would be correct, but it is pretty difficult to do this I think with traditional methods (I mean without using something more smart from gestalt theory etc.). The main purpose of the tool is to extract titles of (peer-reviewed) articles and they do not have a cover page and they usually have a simple layout. On the other hand, I am not 100% sure but it might not be difficult to implement what you say and it might have some use. So I reopen the issue, I will check this when I do some implementation. So the changes can be:

  • deprecate but do not remove --page-number, defaults to 1
  • introduce --first-page-number, defaults to --page-number
  • introduce --last-page-number (inclusive), defaults to --first-page-number. If --last-page-number is different and the actual number of pages is less than this, I guess it makes sense to terminate the process silently at the end of the document.

metebalci avatar Nov 21 '21 16:11 metebalci