XML.jl icon indicating copy to clipboard operation
XML.jl copied to clipboard

XML character references are not unescaped/escaped

Open stemann opened this issue 2 years ago • 4 comments
trafficstars

XML character entity references, e.g. Å ("Å"), and XML numeric character references, e.g. Å ("Å"), are not unescaped/escaped by XML.unescape and XML.escape methods.

stemann avatar Aug 29 '23 08:08 stemann

Something like the following may help (for unescaping hexadecimal numeric character references):

function unescape_unicode(s::AbstractString)
    i = firstindex(s)
    while (m = match(r"&#(x)(\w{2,4});", s, i)) !== nothing
        s = replace(s, m.match => unescape_string("\\u$(m.captures[2])"))
        i = m.offset + 1
    end
    return s
end

stemann avatar Aug 29 '23 08:08 stemann

Hmm, these entities need to be defined in the DTD, correct? I think we'd need (un)escape methods that take in an XML.DTDBody as well as the string.

joshday avatar Aug 29 '23 14:08 joshday

Ah - yes - that's right - my ancient memory of XML, and in particular HTML, led me to believe that they were built-in also in XML, but I see now that XML only defines five entities - and all of the HTML-like entities are mostly/solely defined for HTML: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Standard_public_entity_sets_for_characters

Perhaps one could just have some html-convenience escape methods...

stemann avatar Aug 30 '23 07:08 stemann

Or perhaps provide something convenient for getting common DTDs like http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd

stemann avatar Aug 30 '23 07:08 stemann