XML.jl
XML.jl copied to clipboard
XML character references are not unescaped/escaped
XML character entity references, e.g. Å ("Å"), and XML numeric character references, e.g. Å ("Å"), are not unescaped/escaped by XML.unescape and XML.escape methods.
Something like the following may help (for unescaping hexadecimal numeric character references):
function unescape_unicode(s::AbstractString)
i = firstindex(s)
while (m = match(r"&#(x)(\w{2,4});", s, i)) !== nothing
s = replace(s, m.match => unescape_string("\\u$(m.captures[2])"))
i = m.offset + 1
end
return s
end
Hmm, these entities need to be defined in the DTD, correct? I think we'd need (un)escape methods that take in an XML.DTDBody as well as the string.
Ah - yes - that's right - my ancient memory of XML, and in particular HTML, led me to believe that they were built-in also in XML, but I see now that XML only defines five entities - and all of the HTML-like entities are mostly/solely defined for HTML: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Standard_public_entity_sets_for_characters
Perhaps one could just have some html-convenience escape methods...
Or perhaps provide something convenient for getting common DTDs like http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd