unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/skipping-figures

Open joelgwebber opened this issue 5 months ago • 1 comments

Describe the bug Perhaps on the cusp between bug and feature. When parsing html pages, I found it surprising that any sub-tree wrapped in a <figure> is silently removed in partition_html(). Common cases include just about every Wikipedia article, which often contain useful image urls and text descriptions in <figure>s.

I haven't dug in much further, but from a quick examination of the code, it looks like this may extend to other less-common element types.

To Reproduce

from unstructured.partition.html import partition_html

elems = partition_html(url="https://en.wikipedia.org/wiki/Neo-Riemannian_theory")
def find(text: str):
    for elem in elems:
        if elem.text.find(text) >= 0:
            print("found it:\n", elem)
            return
    print("nope")

find("loose collection of ideas") # finds this in the initial paragraph
find("minor as upside down major") # can't find this because it's buried in a figure

Expected behavior That the <figure> contents would either be found by default, or with an option controlling which elements to skip.

Environment Info I don't have a local build going yet, but I promise it's a trivial repro in any environment.

joelgwebber avatar Sep 08 '24 16:09 joelgwebber