gofeed icon indicating copy to clipboard operation
gofeed copied to clipboard

Make illegal character sanitization more robust

Open mmcdole opened this issue 2 years ago • 2 comments

Following issue #180, #25 and some other issues, I'd like to make character sanitization more robust.

I've previously tried to have the code do something like the following:

func sanitizeXML(xmlData string) string {
	var buffer bytes.Buffer

	for _, r := range xmlData {
		if isLegalXMLChar(r) {
			buffer.WriteRune(r)
		} else {
			// Replace illegal characters with their XML character reference.
			// You can also skip writing illegal characters by commenting the next line.
			buffer.WriteString(fmt.Sprintf("&#x%X;", r))
		}
	}

	return buffer.String()
}

func isLegalXMLChar(r rune) bool {
	return r == 0x9 || r == 0xA || r == 0xD ||
		(r >= 0x20 && r <= 0xD7FF) ||
		(r >= 0xE000 && r <= 0xFFFD) ||
		(r >= 0x10000 && r <= 0x10FFFF)
}

However, there is an old issue #21 that indicated that when I sanitized these characters, it then messed up parsing non-utf8 feeds.

If anyone has any suggestions for how to accommodate both requirements:

  • Stripping illegal characters from feeds to prevent the xml parser from throwing an error
  • Allowing the parsing of non-utf8 feeds

It would be much appreciated!

mmcdole avatar Mar 25 '23 16:03 mmcdole

I'm guessing I need to handle this by first:

  1. Parsing non-UTF8 feeds into UTF8 first
  2. Sanitize the feed afterwards

I could do something like:

func convertToUTF8(data []byte) (string, error) {
	reader, err := charset.NewReader(bytes.NewReader(data), "")
	if err != nil {
		return "", err
	}
	utf8Data, err := ioutil.ReadAll(reader)
	if err != nil {
		return "", err
	}
	return string(utf8Data), nil
}

func sanitizeXML(xmlData []byte) (string, error) {
	utf8Data, err := convertToUTF8(xmlData)
	if err != nil {
		utf8Data = string(xmlData) // Fallback to original data if conversion fails
	}

	var buffer bytes.Buffer

	for _, r := range utf8Data {
		if isLegalXMLChar(r) {
			buffer.WriteRune(r)
		} else {
			buffer.WriteString(fmt.Sprintf("&#x%X;", r))
		}
	}

	return buffer.String(), nil
}

I could call this at the beginning of the sanitize function, but I'm not sure what I'd do if charset.NewReader failed to detect the encoding.

mmcdole avatar Mar 25 '23 16:03 mmcdole