opensearchserver icon indicating copy to clipboard operation
opensearchserver copied to clipboard

Missing pdf title

Open Mojster opened this issue 7 years ago • 5 comments

Hello,

I'm using a web crawler index. When it crawlers to a pdf document normally it extracts a title. For some documents, it did not extract a title. So If I go to render template an search for it. I cannot follow the link, because it's title based.

Could you please advise me, how to fix this. Is it a bug or a normal behaviour? If you need some example documents, I could provide you with some urls.

Tanks for the great work and for your previous supports.

Mojster avatar May 03 '17 08:05 Mojster

Definitely interested by an example.

Currently we are using the PDFBox library to extract those informations. We may update the library (if required) or open an issue.

emmanuel-keller avatar May 08 '17 21:05 emmanuel-keller

Example image with search results. pdf_example

Link to working pdf from image: http://home.izum.si/izum/e-prirocniki/5_COBISS3_Izposoja/Cel_5_COBISS3_Izposoja.pdf Links to pdf with no title: http://home.izum.si/cobiss/oz/HTML/OZ_2012_4_final/files/assets/common/downloads/publication.pdf http://home.izum.si/cobiss/OZ/HTML/OZ_2012_4_final/files/assets/common/downloads/publication.pdf

Hope this helps you further.

Could also be a problem with pdf. I'll continue with investigation on this part.

Mojster avatar May 10 '17 08:05 Mojster

Found out, that if I open the file in Acrobat Reader and go to File->Properties, there's a title field. If it's empty than normally PDFBox couldn't extract it.

I'm closing the issue because this is a mistake of the PDF issuer.

Mojster avatar May 10 '17 10:05 Mojster

I'd like to reopen this. If there is no title in the PDF file there should be a fallback. E.g. use the URL as title. Otherwise the user can't click the result.

Marx1st avatar Jan 22 '18 22:01 Marx1st

I agree with you. So I've reopened it.

Mojster avatar Feb 16 '18 09:02 Mojster