gofeed
gofeed copied to clipboard
Dealing with invalid utf-8
Expected behavior
Filter out non-utf-8 characters automatically or allow to opt-in for this behavior.
Actual behavior
Error XML syntax error on line 93: invalid UTF-8
is produced and the feed cannot be processed.
Steps to reproduce the behavior
It seems to be happening only if I fetch the feed from https://ain.ua/feed
using f.ParseURL
. When I open a locally saved file with f.Parse
, it works.
You can fetch feed using your code, not f.ParseURL
. And add this header to your request.
req.Header.Set("Accept-Charset", "utf-8")
And read response using this auto decoder
// Determine encoding and read body
reader, err := charset.NewReader(resp.Body, resp.Header.Get("Content-Type"))
if err != nil {
return nil, err
}
I have used the following workaround: https://github.com/kisielk/gorge/blob/master/util/util.go which strips non-utf8 chars from the stream.
Yes this could work for any content. But it removes not decodes. Anyway, if you are satisfied with this, no problem :)
@musabgultekin yeah, I don't think that the problem is that the site serves wrong encoding I think it's really just badly-encoded utf8 and only some chars are broken. The content still looks good after bad chars are removed.
I digged in this issue and found out it caused by encoding/xml package. The package checks if characters are in the xml characters range, and if not, pop that error.
I copy isInCharacterRange()
to my code and filter all characters with it before feeding into gofeed.Parser
.