tika-python icon indicating copy to clipboard operation
tika-python copied to clipboard

Tika-Python does not parse the metadata from PDF

Open Apurv3377 opened this issue 2 years ago • 1 comments

Sorry for such a general issue. But I have been trying hard to extract Metadata (Author, Title, Abstract) from PDF using Tika-python client. But unfortunately, It is not able to extract any data under metadata tag. Is there anything missing?

Input PDF link

Here is my code

import tika
from tika import parser
from dicttoxml import dicttoxml
from xml.dom.minidom import parseString

tika.initVM()
parsed=parser.from_file('247.tar_1710.11035.gz_MTforGSW_black.pdf')
xml = dicttoxml(parsed['metadata'], custom_root='PDF', attr_type=False)
dom = parseString(xml)
print(dom.toprettyxml())

Metadata Output

<?xml version="1.0" ?>
<PDF>
	<Author/>
	<Content-Type>application/pdf</Content-Type>
	<Creation-Date>2020-05-30T02:21:14Z</Creation-Date>
	<Keywords/>
	<Last-Modified>2020-05-30T02:21:14Z</Last-Modified>
	<Last-Save-Date>2020-05-30T02:21:14Z</Last-Save-Date>
	<PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018/W32TeX) kpathsea version 6.3.0</PTEX.Fullbanner>
	<X-Parsed-By>
		<item>org.apache.tika.parser.DefaultParser</item>
		<item>org.apache.tika.parser.pdf.PDFParser</item>
	</X-Parsed-By>
	<key name="X-TIKA:content_handler">ToTextContentHandler</key>
	<key name="X-TIKA:embedded_depth">0</key>
	<key name="X-TIKA:parse_time_millis">53</key>
	<key name="access_permission:assemble_document">true</key>
	<key name="access_permission:can_modify">true</key>
	<key name="access_permission:can_print">true</key>
	<key name="access_permission:can_print_degraded">true</key>
	<key name="access_permission:extract_content">true</key>
	<key name="access_permission:extract_for_accessibility">true</key>
	<key name="access_permission:fill_in_form">true</key>
	<key name="access_permission:modify_annotations">true</key>
	<key name="cp:subject"/>
	<created>2020-05-30T02:21:14Z</created>
	<creator/>
	<date>2020-05-30T02:21:14Z</date>
	<key name="dc:creator"/>
	<key name="dc:format">application/pdf; version=1.5</key>
	<key name="dc:subject"/>
	<key name="dc:title"/>
	<key name="dcterms:created">2020-05-30T02:21:14Z</key>
	<key name="dcterms:modified">2020-05-30T02:21:14Z</key>
	<key name="meta:author"/>
	<key name="meta:creation-date">2020-05-30T02:21:14Z</key>
	<key name="meta:keyword"/>
	<key name="meta:save-date">2020-05-30T02:21:14Z</key>
	<modified>2020-05-30T02:21:14Z</modified>
	<key name="pdf:PDFVersion">1.5</key>
	<key name="pdf:charsPerPage">
		<item>4556</item>
		<item>4652</item>
		<item>4515</item>
		<item>5149</item>
		<item>4856</item>
		<item>4552</item>
		<item>4191</item>
		<item>3190</item>
	</key>
	<key name="pdf:docinfo:created">2020-05-30T02:21:14Z</key>
	<key name="pdf:docinfo:creator"/>
	<key name="pdf:docinfo:creator_tool">LaTeX with hyperref</key>
	<key name="pdf:docinfo:custom:PTEX.Fullbanner">This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018/W32TeX) kpathsea version 6.3.0</key>
	<key name="pdf:docinfo:keywords"/>
	<key name="pdf:docinfo:modified">2020-05-30T02:21:14Z</key>
	<key name="pdf:docinfo:producer">pdfTeX-1.40.19</key>
	<key name="pdf:docinfo:subject"/>
	<key name="pdf:docinfo:title"/>
	<key name="pdf:docinfo:trapped">False</key>
	<key name="pdf:encrypted">false</key>
	<key name="pdf:hasMarkedContent">false</key>
	<key name="pdf:hasXFA">false</key>
	<key name="pdf:hasXMP">false</key>
	<key name="pdf:unmappedUnicodeCharsPerPage">
		<item>0</item>
		<item>0</item>
		<item>0</item>
		<item>6</item>
		<item>0</item>
		<item>0</item>
		<item>0</item>
		<item>0</item>
	</key>
	<producer>pdfTeX-1.40.19</producer>
	<resourceName>b'247.tar_1710.11035.gz_MTforGSW_black.pdf'</resourceName>
	<subject/>
	<title/>
	<trapped>False</trapped>
	<key name="xmp:CreatorTool">LaTeX with hyperref</key>
	<key name="xmpTPg:NPages">8</key>
</PDF>

Apurv3377 avatar Jul 27 '21 13:07 Apurv3377

I just started using Tika and I've stumbled across the same issue. Have you find a way to solve this or not? Thanks

A-acuto avatar Apr 27 '22 11:04 A-acuto

are you sure that the PDF actually has the author attribute set? It's possible that the tool that created the PDF file didn't set this or it was e.g., missing in the environment variables and didn't get passed through, etc.

chrismattmann avatar Dec 31 '22 21:12 chrismattmann

not enough detail to action this. Please comment more if you have more detail. Thanks for raising this.

chrismattmann avatar Dec 31 '22 21:12 chrismattmann