pdfminer3k icon indicating copy to clipboard operation
pdfminer3k copied to clipboard

WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont=

Open jackyetz opened this issue 5 years ago • 3 comments

When extracting text from pdf (https://www.aanda.org/articles/aa/pdf/2006/02/aa3061-05.pdf), I got a lot of warning and the extraction failed.

My code is as: import os import sys import importlib importlib.reload(sys) from pdfminer.pdfparser import PDFParser,PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LTTextBoxHorizontal,LAParams from pdfminer.pdfinterp import PDFTextExtractionNotAllowed def parse(path,target): if (os.path.exists(target)): os.remove(target) fp = open(path, 'rb') praser = PDFParser(fp) doc = PDFDocument() praser.set_document(doc) doc.set_parser(praser)

doc.initialize()

if not doc.is_extractable:
    raise PDFTextExtractionNotAllowed
else:
    rsrcmgr = PDFResourceManager()
    laparams = LAParams(all_texts = True)
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    for page in doc.get_pages(): # doc.get_pages() 获取page列表
        interpreter.process_page(page)
        layout = device.get_result()
        for x in layout:
            if (isinstance(x, LTTextBoxHorizontal)):
                with open(target, 'a', encoding='utf-8') as f:
                    results = x.get_text()
                    # print(results)
                    f.write(results + '\n')

if name == 'main': path = r'./pdf/aa3061-05.pdf' parse(path,path.replace('.pdf','.txt'))

the warnings: ...... WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 4 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 5 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5 ......

jackyetz avatar Mar 03 '19 17:03 jackyetz

I'm getting tem same problem. I'll let you know if I fix it.

paulfwb avatar May 03 '20 14:05 paulfwb

Could you share your solution, please! I have the same problem.

rocket2016 avatar Jan 11 '21 17:01 rocket2016

I'm getting tem same problem. I'll let you know if I fix it.

Could you share your solution, please! I have the same problem.

rocket2016 avatar Jan 11 '21 17:01 rocket2016