Add Images
@addie9800 Thanks for pushing this idea 👍 I think this would be an awesome addition to Fundus! 🚀
It looks like you accidentally pushed a lot of unrelated files to the draft, making it harder for me to focus on the core idea. Would you mind removing those files so we can talk about the changes?
Ah well, sorry about that. I did only intend to push one extra file :)... I cleaned up a bit now, in case you want to have a look at it, but I haven't reached a real milestone yet, since you last had a peek at it. I think the issue I am struggling with most at the moment is the dynamic rescaling of images where some publishers change the path of the url according to the necessary resolution for the given screen, making it difficult to come up with a selector for the corresponding img element. If you have any idea, shoot ;)
Here are some examples, of problematic cases:
- The file name is changed to reflect resolution and width of an image: The article: https://www.spiegel.de/netzwelt/apps/instagram-beschraenkt-accounts-von-teenagern-a-3b65f364-e98b-4a31-8cfc-2d7ec4fc19bf#ref=rss has three versions of the same image in it's JSON: https://cdn.prod.www.spiegel.de/images/50b64eef-9d82-4021-93dc-732207c263cc_w1200_r1.778_fpx70_fpy50.99.jpg , https://cdn.prod.www.spiegel.de/images/50b64eef-9d82-4021-93dc-732207c263cc_w1200_r1.33_fpx70_fpy50.99.jpg and https://cdn.prod.www.spiegel.de/images/50b64eef-9d82-4021-93dc-732207c263cc_w1200_r1_fpx70_fpy50.99.jpg . Yet none of these links are actually in an
imgelement on the website. In my browser the actual link is: https://cdn.prod.www.spiegel.de/images/50b64eef-9d82-4021-93dc-732207c263cc_w960_r1.778_fpx70_fpy50.99.jpg - Similarly, some publishers change the path to the image based on the image resolution: in this article: https://www.welt.de/politik/deutschland/article253552426/Christian-Lindners-Steuerplaene-entlasten-Gutverdiener-am-staerksten.html there is one image in the JSON: https://img.welt.de/img/politik/deutschland/mobile253552538/4727938867-ci16x9-w1200/Session-of-the-lower-house-of-German-parliament-Bundestag-in-Ber.jpg but in my browser, this image is used: https://img.welt.de/img/politik/deutschland/mobile253552538/4727938867-ci23x11-w20/Session-of-the-lower-house-of-German-parliament-Bundestag-in-Ber.jpg
Update: As of now, I have verified the functionality for TheNamibian, DerStandard, ORF, NineNews and CBCNews. They can be used to get an impression of the intended functionality.
@addie9800 You can eliminate the URL parameter using the following selector on the HTML file.
url_selector = XPath("//meta[@property='og:url']/@content]")
Further, I would like to hear your opinion on this:
With this implementation, we get the matching ImageObject from the ld for every image candidate we identified, so long as there is an image object present at least 😅. Do you say it is worth the additional effort? Or is solely taking information from the "HTML" sufficient?
Implementation
def get_most_similar_url(url: str, candidates: List[str]) -> Tuple[str, float]:
best: Tuple[str, float] = "init", 0
for candidate in candidates:
ratio = SequenceMatcher(None, url, candidate).ratio()
if ratio > best[1]:
best = candidate, ratio
return best
def get_image_objects_from_ld(images: List[Image], ld: LinkedDataMapping) -> List[Optional[Dict[str, JSONVal]]]:
objects = []
urls = ld.xpath_search(XPath("//*[U0040type[text()='ImageObject']]/*[contains(name(), 'url')]"))
for image in images:
similar = get_most_similar_url(image.urls[0], urls)
if similar[1] >= 0.8:
similar_url = similar[0]
query = f"//*[U0040type[text()='ImageObject'] and */text()='{similar_url}']"
image_objects = ld.xpath_search(XPath(query))
if len(image_objects) > 1:
# TODO: decide which object to take
objects.append(image_objects[0])
else:
objects.extend(image_objects)
else:
objects.append(None)
return objects
Another idea that came to my mind while looking at the similarity_threshold: Wouldn't it be more sufficient to use the bounds within load_images_from_html as well? Wouldn't this replace the entire merging part? Or at least reduce the number of potential candidates a lot?
@addie9800 You can eliminate the URL parameter using the following selector on the HTML file.
Thanks, I changed that
Further, I would like to hear your opinion on this: With this implementation, we get the matching
ImageObjectfrom the ld for every image candidate we identified, so long as there is an image object present at least 😅. Do you say it is worth the additional effort? Or is solely taking information from the "HTML" sufficient?
I think this: so long as there is an image object present at least is the 'springender Punkt'. From what I've seen it's too much of a hassle to actually find the ImageObjects within the JSON, because many publishers have slightly different implementations. Then, not all images in there are necessarily relevant, or have the same URL as the images actually used in the HTML. And, a lot of publishers don't include any extra information, some just list image URLs. Lastly, these lists may also be incomplete, only containing the cover images. Long story short - I think it's too much extra work, with probably no(t much) gain, since all the info from the JSON, at least in the publishers we integrated up till now, is also (easily) available in the HTML.
Another idea that came to my mind while looking at the
similarity_threshold: Wouldn't it be more sufficient to use the bounds withinload_images_from_htmlas well? Wouldn't this replace the entire merging part? Or at least reduce the number of potential candidates a lot?
You might be right about that. I think it can be a step to integrate the merge_images into the load_images_from_html function. It's mostly still a relic from when we sourced the images from json and html and had to map them together.