pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

font size of each character identical for each text line

Open femifrak opened this issue 1 year ago • 0 comments

pdfminer.six is really helpful! Thanks a lot!

However, I struggle to determine the font size of each character. I use the below function with the given set of LAParams to enforce complete lines belonging to one bounding box (BB). For each line a single font size is determined. (To my understanding because the font size is just the vertical BB size.) My pdf is a scan including ocr. Probably caused by slightly bent lines, the font size of ordinary text is often erroneously determined to be larger than that of headlines.

To avoid this, I tried to enforce separate BBs for each character using another LAParams set (commented below). But this still results in the same font size for each character of a text line. (Different text lines result in different font sizes, though.) Isn't this strange? Or do I understand something wrong? How can I extract more realistic character sizes to distinguish normal text from headlines?

BTW: I do not want to provide absolute thresholds to remain universal. Instead I determine the most frequent font size and outliers (= headlines), resp.

def extract_pageinfo(page):
    coords = []
    texts = []
    fontsizes = []

    resource_manager = PDFResourceManager()

    # one bb per text line
    laparams = LAParams(line_margin=0, char_margin=10)

    # one bb per character
    # laparams = LAParams(line_margin=0, char_margin=0, line_overlap=0, word_margin=0)

    device = PDFPageAggregator(resource_manager, laparams=laparams)
    interpreter = PDFPageInterpreter(resource_manager, device)

    interpreter.process_page(page)
    layout = device.get_result()

    for obj in layout:
        if isinstance(obj, LTTextBox):
            coords.append(obj.bbox)
            texts.append(obj.get_text())
            for text_line in obj:
                for character in text_line:
                    if isinstance(character, LTChar):
                        fontsizes.append(character.size)

    return coords, texts, fontsizes

femifrak avatar Sep 13 '23 12:09 femifrak