gofeed
gofeed copied to clipboard
Image always nil
Expected behavior
Item.Image returns string
Actual behavior
Item.Image is nil
Steps to reproduce the behavior
I tried ~2000 articles from 279 different RSS feeds (all of them were XML) and all of the 2000 goffed.Item had nil for image. I looked at the XML feeds, there were image links in them. It must be a parsing errror.
@dewsy could you share links to a few RSS feeds where you were unable to get this to work?
Similar problem here,
https://businessdesk.co.nz/feed
Not getting any image data and also another problem I'm seeing with this feed is I get a strange value for Author like this:
author: { email: "<name>Some Name</name><email>SomeEmail</email>" }
Thanks @arrow-dev will try it out this weekend.
So, there are two issues here.
Author
The author doesn't get parsed properly because the RSS spec expects
<author>[email protected] (Lawyer Boyer)</author>
but the XML contains
<author>
<name>Howard</name>
<email>[email protected]</email>
</author>
Image Data
I can see an image tag inside items, like so
<image>
<url>https://media.businessdesk.co.nz/file/c_fill,w_330/Invercargill-Dee-St.jpg</url>
<title>Tiwai closure: Southlanders look for options</title>
<link>https://businessdesk.co.nz/article/tiwai-closure-southlanders-look-for-options</link>
</image>
Though, what I find strange is that the RSS spec doesn't have image as a tag inside items
I believe this is why, the default mappings for gofeed.Item
uses only the following for Rss /rss/channel/item/itunes:image
and /rss/channel/item/media:image
and none for Atom.
Possible Solutions
Overall, I'm not sure about raising a PR in GoFeed for parsing elements that are not in the spec. Here's the documentation about using a custom parser
@sudhanshuraheja Thanks for that info, first time dealing with RSS. With the Author it looked like it was lining up with the struct in GoFeed and the same with Image. Looks like this feed is not correctly formatted to the spec and I will need to do a custom parser. - I agree there would be no need for a PR to parse elements not in the spec.
@arrow-dev Did you end up writing a custom parser? I too would like to be able to extract images from various RSS/Atom feeds for https://feeds.twtxt.net/ (sigh no one follows the spec!)
@prologic Actually just ended up using the standard library as shown here
@prologic Actually just ended up using the standard library as shown here
Can you share your code? I don't really feel like reinventing Atom/RSS parsing for what amounts to similar problems you had with gofeed :)
@prologic I'm not able to share the code but basically it is exactly like in that linked example from godoc,
-
define type(s)/struct(s) reflecting the fields that you want to pull out, in the same shape as the feed/xml with meta tags to map the xml field names and/or attributes
-
instantiate a variable with the empty/default value of that struct
-
call xml.Unmarshal(xml feed as bytes, pointer to the var to populate with data) and you should end up with the struct being populated
It seems images are all over the place. I'm seeing:
- extensions.media.content where attrs.medium = 'image' or attrs.type = 'image/*' (gmail blog another example of surrounding the author name with more xml tags, xkcd)
- enclosures where type = 'image/*' (androidpolice)
- the only image is the first
<img>
in content. (Ars, fivethirtyeight, marco.org) - the only image is the first
<img>
in description
Would gofeed devs accept a patch that does a best-effort search in these locations (others?) to fill in the Image field? Any arbitrary search order is ok with me, it would just be nice to have any preview thumbnail for a bare bones rss reader that I'm building without having to break out the parsers and query engines.
+1 to this--nytimes uses media:content medium=image
. I'm parsing a bunch of different major news sources, and none get images set.
Would love to get some more best-effort in for getting images.
(PS as someone who maintains an RSS parser in JS, I know it's a slog. Thanks for the hard work here)
Here's a workaround that seems to capture most images:
func getImageFromExtensions(item gofeed.Item) string {
if media, ok := item.Extensions["media"]; ok {
if content, ok := media["content"]; ok {
for _, c := range content {
if strings.Contains(c.Attrs["type"], "image") || strings.Contains(c.Attrs["medium"], "image") {
return c.Attrs["url"]
}
}
}
}
return ""
}
Edit: to capture HTML images, like XKCD, here's a more complex pass using "golang.org/x/net/html"
import (
"strings"
"golang.org/x/net/html"
"github.com/mmcdole/gofeed"
)
func getImageFromExtensions(item gofeed.Item) string {
if media, ok := item.Extensions["media"]; ok {
if content, ok := media["content"]; ok {
for _, c := range content {
if strings.Contains(c.Attrs["type"], "image") || strings.Contains(c.Attrs["medium"], "image") {
return c.Attrs["url"]
}
}
}
}
docImg := findImageInHTML(item.Description)
if docImg != "" {
return docImg
}
return ""
}
func findImageInHTML(contents string) string {
doc, err := html.Parse(bytes.NewBufferString(contents))
if err != nil {
return ""
}
return findImageInHTMLNode(doc)
}
func findImageInHTMLNode(n *html.Node) string {
if n.Type == html.ElementNode && n.Data == "img" {
for _, a := range n.Attr {
if a.Key == "src" {
return a.Val
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
if img := findImageInHTMLNode(c); img != "" {
return img
}
}
return ""
}
+1 to this--nytimes uses
media:content medium=image
. I'm parsing a bunch of different major news sources, and none get images set.Would love to get some more best-effort in for getting images.
I agree.
Some of the existing "best-effort" code, which extracts data from extensions, is in the translator file if you look at the DefaultRSSTranslator
. This file converts the RSS object to the unified gofeed format. You can see that we pull some fields out of the iTunesExt object, etc. This approach is convenient because it ensures that the RSS feed and item content are fully parsed before digging into them for various fields in different locations.
[RSS Parser] > [RSS Feed] > [DefaultRSSTranslator] > [Gofeed Obj]
The question is whether we should add this functionality there, or modify rss/parser.go
to perform best-effort image parsing, even when using the RSS parser directly, instead of the universal gofeed parser.
I lean towards leaving the RSS / Atom / other parsers as parsing the "raw" feed exactly as it is represented, and the gofeed has some quality of life features like best-effort parsing of various deeply nested fields, but I'm not totally sure.
Any thoughts @rbren ?