gofeed
gofeed copied to clipboard
Make illegal character sanitization more robust
Following issue #180, #25 and some other issues, I'd like to make character sanitization more robust.
I've previously tried to have the code do something like the following:
func sanitizeXML(xmlData string) string {
var buffer bytes.Buffer
for _, r := range xmlData {
if isLegalXMLChar(r) {
buffer.WriteRune(r)
} else {
// Replace illegal characters with their XML character reference.
// You can also skip writing illegal characters by commenting the next line.
buffer.WriteString(fmt.Sprintf("&#x%X;", r))
}
}
return buffer.String()
}
func isLegalXMLChar(r rune) bool {
return r == 0x9 || r == 0xA || r == 0xD ||
(r >= 0x20 && r <= 0xD7FF) ||
(r >= 0xE000 && r <= 0xFFFD) ||
(r >= 0x10000 && r <= 0x10FFFF)
}
However, there is an old issue #21 that indicated that when I sanitized these characters, it then messed up parsing non-utf8 feeds.
If anyone has any suggestions for how to accommodate both requirements:
- Stripping illegal characters from feeds to prevent the xml parser from throwing an error
- Allowing the parsing of non-utf8 feeds
It would be much appreciated!
I'm guessing I need to handle this by first:
- Parsing non-UTF8 feeds into UTF8 first
- Sanitize the feed afterwards
I could do something like:
func convertToUTF8(data []byte) (string, error) {
reader, err := charset.NewReader(bytes.NewReader(data), "")
if err != nil {
return "", err
}
utf8Data, err := ioutil.ReadAll(reader)
if err != nil {
return "", err
}
return string(utf8Data), nil
}
func sanitizeXML(xmlData []byte) (string, error) {
utf8Data, err := convertToUTF8(xmlData)
if err != nil {
utf8Data = string(xmlData) // Fallback to original data if conversion fails
}
var buffer bytes.Buffer
for _, r := range utf8Data {
if isLegalXMLChar(r) {
buffer.WriteRune(r)
} else {
buffer.WriteString(fmt.Sprintf("&#x%X;", r))
}
}
return buffer.String(), nil
}
I could call this at the beginning of the sanitize function, but I'm not sure what I'd do if charset.NewReader failed to detect the encoding.