gofeed icon indicating copy to clipboard operation
gofeed copied to clipboard

Dealing with invalid utf-8

Open OrKoN opened this issue 5 years ago • 5 comments

Expected behavior

Filter out non-utf-8 characters automatically or allow to opt-in for this behavior.

Actual behavior

Error XML syntax error on line 93: invalid UTF-8 is produced and the feed cannot be processed.

Steps to reproduce the behavior

It seems to be happening only if I fetch the feed from https://ain.ua/feed using f.ParseURL. When I open a locally saved file with f.Parse, it works.

ain.zip

OrKoN avatar Apr 08 '19 17:04 OrKoN

You can fetch feed using your code, not f.ParseURL. And add this header to your request.

req.Header.Set("Accept-Charset", "utf-8")

And read response using this auto decoder

// Determine encoding and read body
reader, err := charset.NewReader(resp.Body, resp.Header.Get("Content-Type"))
if err != nil {
	return nil, err
}

musabgultekin avatar Apr 20 '19 11:04 musabgultekin

I have used the following workaround: https://github.com/kisielk/gorge/blob/master/util/util.go which strips non-utf8 chars from the stream.

OrKoN avatar Apr 20 '19 11:04 OrKoN

Yes this could work for any content. But it removes not decodes. Anyway, if you are satisfied with this, no problem :)

musabgultekin avatar Apr 20 '19 12:04 musabgultekin

@musabgultekin yeah, I don't think that the problem is that the site serves wrong encoding I think it's really just badly-encoded utf8 and only some chars are broken. The content still looks good after bad chars are removed.

OrKoN avatar Apr 20 '19 12:04 OrKoN

I digged in this issue and found out it caused by encoding/xml package. The package checks if characters are in the xml characters range, and if not, pop that error.

I copy isInCharacterRange() to my code and filter all characters with it before feeding into gofeed.Parser.

googollee avatar Oct 22 '20 13:10 googollee