unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

feat/ extract style or font for Text elements.

Open LunaticMaestro opened this issue 5 months ago • 7 comments

I was trying out the tutorial. However, when partitioning the PDF provided in tutorial, I did not observe that the font-style of the text being stored in the Metadata for the element.

Is the font-style extraction planned in future?

LunaticMaestro avatar Mar 26 '24 06:03 LunaticMaestro

@LunaticMaestro font style is stored in .metadata.emphasized_text_contents and .metadata.emphasized_text_tags. Did you look there?

scanny avatar Mar 26 '24 17:03 scanny

Hi scanny, Thanks for reply. Unfortunately, the suggested metadata does not contain the requested content.

Find the screenshot attached.

I am using the PDF from example docs example-docs/layout-parser-paper.pdf

image

LunaticMaestro avatar Mar 27 '24 03:03 LunaticMaestro

Hi @LunaticMaestro yes, unfortunately it turns out that metadata is not supported for PDF, apologies for that.

It is supported for DOCX however if that's a help.

scanny avatar Mar 27 '24 22:03 scanny

I beg to differ. Here's the example snippet reading DOCX file and failing to decipher the font elements.

Find the DOCX file attached for purpose of reproduing. redacted.docx

image

LunaticMaestro avatar Mar 28 '24 04:03 LunaticMaestro

@LunaticMaestro the file you referenced has character styling set using a character style, which is unfortunately not yet supported.

However, text that is made bold or italic directly, using the toolbar buttons is properly detected.

I added the following paragraph to the document: "This is a paragraph that has some bold and some italic.", with the words "bold" and "italic" formatted with the toolbar buttons and it produces the following metadata:

{
    'category_depth': 0,
    'emphasized_text_contents': ['bold', 'italic'],
    'emphasized_text_tags': ['b', 'i'],
    'last_modified': '2024-03-27T22:03:51',
    'languages': ['eng'],
    'parent_id': 'ede9865e755cdea84eb99e51cb277a0e',
    'file_directory': '/Users/scanny/Desktop',
    'filename': 'redacted.docx',
    'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
}

scanny avatar Mar 28 '24 05:03 scanny

Since unstructured re-uses pdfminer reference. I am expecting for native implementations of pdf miner to get the character properties, example: pdf miner character style.

LunaticMaestro avatar Mar 28 '24 05:03 LunaticMaestro