tika-python
tika-python copied to clipboard
Tika-Python does not parse the metadata from PDF
Sorry for such a general issue. But I have been trying hard to extract Metadata (Author, Title, Abstract) from PDF using Tika-python client. But unfortunately, It is not able to extract any data under metadata tag. Is there anything missing?
Here is my code
import tika
from tika import parser
from dicttoxml import dicttoxml
from xml.dom.minidom import parseString
tika.initVM()
parsed=parser.from_file('247.tar_1710.11035.gz_MTforGSW_black.pdf')
xml = dicttoxml(parsed['metadata'], custom_root='PDF', attr_type=False)
dom = parseString(xml)
print(dom.toprettyxml())
Metadata Output
<?xml version="1.0" ?>
<PDF>
<Author/>
<Content-Type>application/pdf</Content-Type>
<Creation-Date>2020-05-30T02:21:14Z</Creation-Date>
<Keywords/>
<Last-Modified>2020-05-30T02:21:14Z</Last-Modified>
<Last-Save-Date>2020-05-30T02:21:14Z</Last-Save-Date>
<PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018/W32TeX) kpathsea version 6.3.0</PTEX.Fullbanner>
<X-Parsed-By>
<item>org.apache.tika.parser.DefaultParser</item>
<item>org.apache.tika.parser.pdf.PDFParser</item>
</X-Parsed-By>
<key name="X-TIKA:content_handler">ToTextContentHandler</key>
<key name="X-TIKA:embedded_depth">0</key>
<key name="X-TIKA:parse_time_millis">53</key>
<key name="access_permission:assemble_document">true</key>
<key name="access_permission:can_modify">true</key>
<key name="access_permission:can_print">true</key>
<key name="access_permission:can_print_degraded">true</key>
<key name="access_permission:extract_content">true</key>
<key name="access_permission:extract_for_accessibility">true</key>
<key name="access_permission:fill_in_form">true</key>
<key name="access_permission:modify_annotations">true</key>
<key name="cp:subject"/>
<created>2020-05-30T02:21:14Z</created>
<creator/>
<date>2020-05-30T02:21:14Z</date>
<key name="dc:creator"/>
<key name="dc:format">application/pdf; version=1.5</key>
<key name="dc:subject"/>
<key name="dc:title"/>
<key name="dcterms:created">2020-05-30T02:21:14Z</key>
<key name="dcterms:modified">2020-05-30T02:21:14Z</key>
<key name="meta:author"/>
<key name="meta:creation-date">2020-05-30T02:21:14Z</key>
<key name="meta:keyword"/>
<key name="meta:save-date">2020-05-30T02:21:14Z</key>
<modified>2020-05-30T02:21:14Z</modified>
<key name="pdf:PDFVersion">1.5</key>
<key name="pdf:charsPerPage">
<item>4556</item>
<item>4652</item>
<item>4515</item>
<item>5149</item>
<item>4856</item>
<item>4552</item>
<item>4191</item>
<item>3190</item>
</key>
<key name="pdf:docinfo:created">2020-05-30T02:21:14Z</key>
<key name="pdf:docinfo:creator"/>
<key name="pdf:docinfo:creator_tool">LaTeX with hyperref</key>
<key name="pdf:docinfo:custom:PTEX.Fullbanner">This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018/W32TeX) kpathsea version 6.3.0</key>
<key name="pdf:docinfo:keywords"/>
<key name="pdf:docinfo:modified">2020-05-30T02:21:14Z</key>
<key name="pdf:docinfo:producer">pdfTeX-1.40.19</key>
<key name="pdf:docinfo:subject"/>
<key name="pdf:docinfo:title"/>
<key name="pdf:docinfo:trapped">False</key>
<key name="pdf:encrypted">false</key>
<key name="pdf:hasMarkedContent">false</key>
<key name="pdf:hasXFA">false</key>
<key name="pdf:hasXMP">false</key>
<key name="pdf:unmappedUnicodeCharsPerPage">
<item>0</item>
<item>0</item>
<item>0</item>
<item>6</item>
<item>0</item>
<item>0</item>
<item>0</item>
<item>0</item>
</key>
<producer>pdfTeX-1.40.19</producer>
<resourceName>b'247.tar_1710.11035.gz_MTforGSW_black.pdf'</resourceName>
<subject/>
<title/>
<trapped>False</trapped>
<key name="xmp:CreatorTool">LaTeX with hyperref</key>
<key name="xmpTPg:NPages">8</key>
</PDF>
I just started using Tika and I've stumbled across the same issue. Have you find a way to solve this or not? Thanks
are you sure that the PDF actually has the author attribute set? It's possible that the tool that created the PDF file didn't set this or it was e.g., missing in the environment variables and didn't get passed through, etc.
not enough detail to action this. Please comment more if you have more detail. Thanks for raising this.