unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

feat/parse_html_embed_objects

Open My3VM opened this issue 2 years ago • 3 comments

I am trying to parse HTML documents containing embedded images and youtube videos inside iframe. I am able to use partition_html function get textual elements, as well metdata object containing ahref tags. However the image element as well iframe elements are being missed out.

I would like to have these data points made available either as separete elements like HTMLImage, HTMLIframe or attach these link urls as well made available as part of the metadata object's link_urls.

My3VM avatar Dec 07 '23 16:12 My3VM

@scanny - What do you think about this? I think I'd rather avoid dynamically linked videos or images in HTML files. For images at least, converting the HTML to PDF could work to extract the images. I don't think we're likely to do anything with iframes.

MthwRobinson avatar Jun 13 '24 13:06 MthwRobinson

tl;dr: We could potentially capture those links but probably not traverse them to actually capture the image or video bytes.

<img>

It has crossed my mind that we could treat <img> as something like a special case of <a> and capture the image URL as metadata. One challenge is that <img> can contain no text, so we'd need to use a placeholder like "image" or maybe the image alt-text when present for the .metadata.link_text field in the document-element.

Traversing the link and downloading the image is something we might consider at some point, possibly in "hi_res" mode. The key concern there would be avoiding malicious content, which is a non-trivial extra engineering effort and probably still a risk no matter what you do to avoid it.

<iframe>

An <iframe> is essentially a link to another web-page that then gets fetched by the browser and displayed in the "frame". Very similar to <img> except a whole HTML page.

I agree that recursively fetching <iframe> web pages and processing them to elements is probably not something we're going to want to support anytime soon. Top of mind for me there would also be the risk of malicious content.

We could extract the link as some sort of metadata, but because <iframe> is empty (that HTML-element can contain no content) there would be no text and therefore no unstructured document-Element to attach that metadata to. So that would require some noodling. We'd need to add a "fake" element or something to go down that route.

scanny avatar Jun 13 '24 17:06 scanny

Yeah downloading malicious content from the link was my main concern as well. I like the idea of treating <img> similar to links and pulling out the link. Let's keep this one open and we can consider doing that.

MthwRobinson avatar Jun 13 '24 17:06 MthwRobinson