unstructured feat/ extract style or font for Text elements.

feat/ extract style or font for Text elements.

Open LunaticMaestro opened this issue 5 months ago • 7 comments

I was trying out the tutorial. However, when partitioning the PDF provided in tutorial, I did not observe that the font-style of the text being stored in the Metadata for the element.

Is the font-style extraction planned in future?

Mar 26 '24 06:03 LunaticMaestro

@LunaticMaestro font style is stored in .metadata.emphasized_text_contents and .metadata.emphasized_text_tags. Did you look there?

Mar 26 '24 17:03 scanny

Hi scanny, Thanks for reply. Unfortunately, the suggested metadata does not contain the requested content.

Find the screenshot attached.

I am using the PDF from example docs example-docs/layout-parser-paper.pdf

Mar 27 '24 03:03 LunaticMaestro

Hi @LunaticMaestro yes, unfortunately it turns out that metadata is not supported for PDF, apologies for that.

It is supported for DOCX however if that's a help.

Mar 27 '24 22:03 scanny

I beg to differ. Here's the example snippet reading DOCX file and failing to decipher the font elements.

Find the DOCX file attached for purpose of reproduing. redacted.docx

Mar 28 '24 04:03 LunaticMaestro

@LunaticMaestro the file you referenced has character styling set using a character style, which is unfortunately not yet supported.

However, text that is made bold or italic directly, using the toolbar buttons is properly detected.

I added the following paragraph to the document: "This is a paragraph that has some bold and some italic.", with the words "bold" and "italic" formatted with the toolbar buttons and it produces the following metadata:

{
    'category_depth': 0,
    'emphasized_text_contents': ['bold', 'italic'],
    'emphasized_text_tags': ['b', 'i'],
    'last_modified': '2024-03-27T22:03:51',
    'languages': ['eng'],
    'parent_id': 'ede9865e755cdea84eb99e51cb277a0e',
    'file_directory': '/Users/scanny/Desktop',
    'filename': 'redacted.docx',
    'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
}

Mar 28 '24 05:03 scanny

Since unstructured re-uses pdfminer reference. I am expecting for native implementations of pdf miner to get the character properties, example: pdf miner character style.

Mar 28 '24 05:03 LunaticMaestro

unstructured unstructured copied to clipboard

feat/ extract style or font for Text elements.

unstructured
unstructured copied to clipboard