unstructured
unstructured copied to clipboard
bug/skipping-figures
Describe the bug
Perhaps on the cusp between bug and feature. When parsing html pages, I found it surprising that any sub-tree wrapped in a <figure>
is silently removed in partition_html()
. Common cases include just about every Wikipedia article, which often contain useful image urls and text descriptions in <figure>
s.
I haven't dug in much further, but from a quick examination of the code, it looks like this may extend to other less-common element types.
To Reproduce
from unstructured.partition.html import partition_html
elems = partition_html(url="https://en.wikipedia.org/wiki/Neo-Riemannian_theory")
def find(text: str):
for elem in elems:
if elem.text.find(text) >= 0:
print("found it:\n", elem)
return
print("nope")
find("loose collection of ideas") # finds this in the initial paragraph
find("minor as upside down major") # can't find this because it's buried in a figure
Expected behavior
That the <figure>
contents would either be found by default, or with an option controlling which elements to skip.
Environment Info I don't have a local build going yet, but I promise it's a trivial repro in any environment.