opensearchserver
opensearchserver copied to clipboard
Missing pdf title
Hello,
I'm using a web crawler index. When it crawlers to a pdf document normally it extracts a title. For some documents, it did not extract a title. So If I go to render template an search for it. I cannot follow the link, because it's title based.
Could you please advise me, how to fix this. Is it a bug or a normal behaviour? If you need some example documents, I could provide you with some urls.
Tanks for the great work and for your previous supports.
Definitely interested by an example.
Currently we are using the PDFBox library to extract those informations. We may update the library (if required) or open an issue.
Example image with search results.
Link to working pdf from image: http://home.izum.si/izum/e-prirocniki/5_COBISS3_Izposoja/Cel_5_COBISS3_Izposoja.pdf Links to pdf with no title: http://home.izum.si/cobiss/oz/HTML/OZ_2012_4_final/files/assets/common/downloads/publication.pdf http://home.izum.si/cobiss/OZ/HTML/OZ_2012_4_final/files/assets/common/downloads/publication.pdf
Hope this helps you further.
Could also be a problem with pdf. I'll continue with investigation on this part.
Found out, that if I open the file in Acrobat Reader and go to File->Properties, there's a title field. If it's empty than normally PDFBox couldn't extract it.
I'm closing the issue because this is a mistake of the PDF issuer.
I'd like to reopen this. If there is no title in the PDF file there should be a fallback. E.g. use the URL as title. Otherwise the user can't click the result.
I agree with you. So I've reopened it.