gofeed icon indicating copy to clipboard operation
gofeed copied to clipboard

Image always nil

Open dewsy opened this issue 5 years ago • 10 comments

Expected behavior

Item.Image returns string

Actual behavior

Item.Image is nil

Steps to reproduce the behavior

I tried ~2000 articles from 279 different RSS feeds (all of them were XML) and all of the 2000 goffed.Item had nil for image. I looked at the XML feeds, there were image links in them. It must be a parsing errror.

dewsy avatar Dec 04 '19 05:12 dewsy

@dewsy could you share links to a few RSS feeds where you were unable to get this to work?

sudhanshuraheja avatar Jun 16 '20 18:06 sudhanshuraheja

Similar problem here,

https://businessdesk.co.nz/feed

Not getting any image data and also another problem I'm seeing with this feed is I get a strange value for Author like this:

author: { email: "<name>Some Name</name><email>SomeEmail</email>" }

arrow-dev avatar Jul 08 '20 22:07 arrow-dev

Thanks @arrow-dev will try it out this weekend.

sudhanshuraheja avatar Jul 09 '20 02:07 sudhanshuraheja

So, there are two issues here.

Author

The author doesn't get parsed properly because the RSS spec expects <author>[email protected] (Lawyer Boyer)</author>

but the XML contains

<author>
    <name>Howard</name>
    <email>[email protected]</email>
</author>

Image Data

I can see an image tag inside items, like so

<image>
    <url>https://media.businessdesk.co.nz/file/c_fill,w_330/Invercargill-Dee-St.jpg</url>
    <title>Tiwai closure: Southlanders look for options</title>
    <link>https://businessdesk.co.nz/article/tiwai-closure-southlanders-look-for-options</link>
</image>

Though, what I find strange is that the RSS spec doesn't have image as a tag inside items

I believe this is why, the default mappings for gofeed.Item uses only the following for Rss /rss/channel/item/itunes:image and /rss/channel/item/media:image and none for Atom.

Possible Solutions

Overall, I'm not sure about raising a PR in GoFeed for parsing elements that are not in the spec. Here's the documentation about using a custom parser

sudhanshuraheja avatar Jul 09 '20 02:07 sudhanshuraheja

@sudhanshuraheja Thanks for that info, first time dealing with RSS. With the Author it looked like it was lining up with the struct in GoFeed and the same with Image. Looks like this feed is not correctly formatted to the spec and I will need to do a custom parser. - I agree there would be no need for a PR to parse elements not in the spec.

arrow-dev avatar Jul 09 '20 04:07 arrow-dev

@arrow-dev Did you end up writing a custom parser? I too would like to be able to extract images from various RSS/Atom feeds for https://feeds.twtxt.net/ (sigh no one follows the spec!)

prologic avatar Oct 15 '20 06:10 prologic

@prologic Actually just ended up using the standard library as shown here

arrow-dev avatar Oct 15 '20 09:10 arrow-dev

@prologic Actually just ended up using the standard library as shown here

Can you share your code? I don't really feel like reinventing Atom/RSS parsing for what amounts to similar problems you had with gofeed :)

prologic avatar Oct 15 '20 22:10 prologic

@prologic I'm not able to share the code but basically it is exactly like in that linked example from godoc,

  • define type(s)/struct(s) reflecting the fields that you want to pull out, in the same shape as the feed/xml with meta tags to map the xml field names and/or attributes

  • instantiate a variable with the empty/default value of that struct

  • call xml.Unmarshal(xml feed as bytes, pointer to the var to populate with data) and you should end up with the struct being populated

arrow-dev avatar Oct 16 '20 01:10 arrow-dev

It seems images are all over the place. I'm seeing:

  • extensions.media.content where attrs.medium = 'image' or attrs.type = 'image/*' (gmail blog another example of surrounding the author name with more xml tags, xkcd)
  • enclosures where type = 'image/*' (androidpolice)
  • the only image is the first <img> in content. (Ars, fivethirtyeight, marco.org)
  • the only image is the first <img> in description

Would gofeed devs accept a patch that does a best-effort search in these locations (others?) to fill in the Image field? Any arbitrary search order is ok with me, it would just be nice to have any preview thumbnail for a bare bones rss reader that I'm building without having to break out the parsers and query engines.

infogulch avatar Aug 30 '23 04:08 infogulch

+1 to this--nytimes uses media:content medium=image. I'm parsing a bunch of different major news sources, and none get images set.

Would love to get some more best-effort in for getting images.

(PS as someone who maintains an RSS parser in JS, I know it's a slog. Thanks for the hard work here)

rbren avatar Feb 19 '24 15:02 rbren

Here's a workaround that seems to capture most images:

func getImageFromExtensions(item gofeed.Item) string {
    if media, ok := item.Extensions["media"]; ok {
        if content, ok := media["content"]; ok {
            for _, c := range content {
                if strings.Contains(c.Attrs["type"], "image") || strings.Contains(c.Attrs["medium"], "image") {
                    return c.Attrs["url"]
                }
            }
        }
    }
    return ""
}

Edit: to capture HTML images, like XKCD, here's a more complex pass using "golang.org/x/net/html"

import (
    "strings"

    "golang.org/x/net/html"
    "github.com/mmcdole/gofeed"
)

func getImageFromExtensions(item gofeed.Item) string {
        if media, ok := item.Extensions["media"]; ok {
                if content, ok := media["content"]; ok {
                        for _, c := range content {
                                if strings.Contains(c.Attrs["type"], "image") || strings.Contains(c.Attrs["medium"], "image") {
                                        return c.Attrs["url"]
                                }
                        }
                }
        }
        docImg := findImageInHTML(item.Description)
        if docImg != "" {
                return docImg
        }
        return ""
}

func findImageInHTML(contents string) string {
        doc, err := html.Parse(bytes.NewBufferString(contents))
        if err != nil {
                return ""
        }
        return findImageInHTMLNode(doc)
}

func findImageInHTMLNode(n *html.Node) string {
        if n.Type == html.ElementNode && n.Data == "img" {
                for _, a := range n.Attr {
                        if a.Key == "src" {
                                return a.Val
                        }
                }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
                if img := findImageInHTMLNode(c); img != "" {
                        return img
                }
        }
        return ""
}

rbren avatar Feb 19 '24 16:02 rbren

+1 to this--nytimes uses media:content medium=image. I'm parsing a bunch of different major news sources, and none get images set.

Would love to get some more best-effort in for getting images.

I agree.

Some of the existing "best-effort" code, which extracts data from extensions, is in the translator file if you look at the DefaultRSSTranslator. This file converts the RSS object to the unified gofeed format. You can see that we pull some fields out of the iTunesExt object, etc. This approach is convenient because it ensures that the RSS feed and item content are fully parsed before digging into them for various fields in different locations.

[RSS Parser] > [RSS Feed] > [DefaultRSSTranslator] > [Gofeed Obj]

The question is whether we should add this functionality there, or modify rss/parser.go to perform best-effort image parsing, even when using the RSS parser directly, instead of the universal gofeed parser.

I lean towards leaving the RSS / Atom / other parsers as parsing the "raw" feed exactly as it is represented, and the gofeed has some quality of life features like best-effort parsing of various deeply nested fields, but I'm not totally sure.

Any thoughts @rbren ?

mmcdole avatar Feb 20 '24 05:02 mmcdole