erddap icon indicating copy to clipboard operation
erddap copied to clipboard

HTML Entity Name munging in XML listings

Open benjwadams opened this issue 1 year ago • 1 comments

ERDDAP does some bizarre name munging to HTML entities in XML listings.

For example in https://gcoos4.tamu.edu/erddap/metadata/iso19115/xml/ there are numerous href values like this 2004JuvenileSportfishNOAA_DATA_Mean_v0_0_iso19115.xml

Most browsers will transform this, but I have had issues with following links in some Python libraries if these HTML entities aren't explicitly escaped beforehand. It's also a pretty odd way to represent simple characters like periods and underscores where the usual characters would suffice. Any reason why these characters shouldn't be used instead of encoding to HTML entities?

benjwadams avatar May 22 '23 17:05 benjwadams

It is the attributes of HTML and XML tags that must be strongly encoded, for security reasons. The code that does this is in com/cohort/util/XML.java in the method called encodeAsHTMLAttribute. The JavaDoc for that method explains:

 * For security reasons, for text that will be used as an HTML or XML attribute, 
 * this replaces non-alphanumeric characters with HTML Entity &#xHHHH; format.
 * See HTML Attribute Encoding at
 * [https://owasp.org/www-pdf-archive/OWASP_Cheatsheets_Book.pdf](https://owasp.org/www-pdf-archive/OWASP_Cheatsheets_Book.pdf)
 * pg 188, section 25.4 
 * "Encoding Type: HTML Attribute Encoding
 * Encoding Mechanism: 
 * Except for alphanumeric characters, escape all characters with the HTML Entity &#xHH;
 * format, including spaces. (HH = Hex Value)".
 * On the need to escape HTML attributes: [http://wonko.com/post/html-escaping](http://wonko.com/post/html-escaping)

Both of the links there are interesting reading.

One might argue that in some circumstances this strict encoding is not necessary. Perhaps. Perhaps not. The problem is that it is very time consuming (even if we assume the programmer has 100% understanding of the situation) and error prone to try to make that determination. It is vastly simpler and (more important) vastly safer to just routinely encode all attributes in the safe and recommended way.

BobSimons avatar May 22 '23 20:05 BobSimons