pdfminer.six
pdfminer.six copied to clipboard
font size of each character identical for each text line
pdfminer.six is really helpful! Thanks a lot!
However, I struggle to determine the font size of each character. I use the below function with the given set of LAParams to enforce complete lines belonging to one bounding box (BB). For each line a single font size is determined. (To my understanding because the font size is just the vertical BB size.) My pdf is a scan including ocr. Probably caused by slightly bent lines, the font size of ordinary text is often erroneously determined to be larger than that of headlines.
To avoid this, I tried to enforce separate BBs for each character using another LAParams set (commented below). But this still results in the same font size for each character of a text line. (Different text lines result in different font sizes, though.) Isn't this strange? Or do I understand something wrong? How can I extract more realistic character sizes to distinguish normal text from headlines?
BTW: I do not want to provide absolute thresholds to remain universal. Instead I determine the most frequent font size and outliers (= headlines), resp.
def extract_pageinfo(page):
coords = []
texts = []
fontsizes = []
resource_manager = PDFResourceManager()
# one bb per text line
laparams = LAParams(line_margin=0, char_margin=10)
# one bb per character
# laparams = LAParams(line_margin=0, char_margin=0, line_overlap=0, word_margin=0)
device = PDFPageAggregator(resource_manager, laparams=laparams)
interpreter = PDFPageInterpreter(resource_manager, device)
interpreter.process_page(page)
layout = device.get_result()
for obj in layout:
if isinstance(obj, LTTextBox):
coords.append(obj.bbox)
texts.append(obj.get_text())
for text_line in obj:
for character in text_line:
if isinstance(character, LTChar):
fontsizes.append(character.size)
return coords, texts, fontsizes