commons-text
commons-text copied to clipboard
TEXT-215: Prevent decimal numeric entities from wrongly including hexadecimal characters
Hello, This a quick bugfix on the NumericEntityUnescaper. The bug allows decimal characters entities without semi-colon and followed by a letter from A to F to be ignored by the translator. A full description of the problem is found in the ticket: https://issues.apache.org/jira/browse/TEXT-215
Hi @kinow ! Thanks for the quick review. I just added a unit test for the "semiColonOptional" option which asserts the unescaping is working for both hexadecimal and decimal entities, with and without semi-colon.
Great! I've allowed the workflow to run. Now just need to find time to review issue & code (though that's rather a small change). Thanks for the pull request @rbunel35 !
Thank you very much !
Hello ! Do you have any news to give me about this fix ? Thanks in advance :)
https://www.w3.org/TR/REC-xml/#dt-charref
Why are illegal entities allowed in the first place? Am I reading the specification incorrectly? The ';' character should be required. IMO this feature creep on our end feels improper and should not be allowed or at the very least deprecated.
https://www.w3.org/TR/REC-xml/#dt-charref
Why are illegal entities allowed in the first place? Am I reading the specification incorrectly? The ';' character should be required. IMO this feature creep on our end feels improper and should not be allowed or at the very least deprecated.
Good point. I haven't checked any specification yet, but this:
# File: test.html
<iframe src="javascript:alert(1)">
Or this:
# File: test.html
<iframe src="javascript:alert(1)">
Both trigger an alert (tested with python3 -m http.server
and visit http://localhost:8000/test.html). I think the JIRA issue mentions how browsers handle this payload, so I suspect users could expect Commons Text to translate it in a similar way (not saying that it's correct or not, and whether we should do it or not :+1: , just FWIW)
My personal opinion is that we should stick to a specific version of a specification, in this case W3C XML. If we also want to emulate what a browser does or what another language does, then that could be the job for a subclass. So maybe we should be refactoring the code? Needs other opinions...
Hi, Thank you for the your answers.
Indeed I understand that semicolon-less character entities do not form part of the specification, however as pointed by @kinow, virtually all modern browsers read them as valid entities anyway, and as a user of Commons Text, I expected it to work the same way.
Also, please note that the acceptance of numeric character entities without semi-colon precedes my contribution (via the "semiColonOptional" value in the enum). My fix only makes it work correctly with hexadecimal entities.
Hey @kinow,
I hope I'm not too much impatient (sorry if that's the case ^^) but do you have news about this PR ? You mentioned I could bump you if it took too long.
Thanks in advance !
Hey @kinow,
I hope I'm not too much impatient (sorry if that's the case ^^) but do you have news about this PR ? You mentioned I could bump you if it took too long.
Thanks in advance !
Hi @rbunel35
Thanks for the remainder. I'm starting a new job in a few months, and will have about one month with extra spare time, when I expect to be able to release Text if nobody else beats me to it.
I think we will have to move this question to a wider audience, @rbunel35 . We have two options here, I can send the email to the Commons Dev mailing list explaining what the PR is fixing, and trying to explain on the difference between browsers & specs, or if you are subscribed you can send the email over there.
The only difference, I think, is that I may forget the issue again if there are no replies over there :smile: in which case you'd have to ping me again. Up to you :+1:
Thanks!!! Bruno
Thanks for the quick answer ! I'm not subscribed yet, but will do it right now, and then I'll send the email to explain my PR. Thank you very much !
Thanks for the quick answer ! I'm not subscribed yet, but will do it right now, and then I'll send the email to explain my PR. Thank you very much !
Brilliant!
Every component is discussed there, that's the only downside. Use the following prefix for your subject, please: "[text] Enter your email subject here".
I have a rule in GMail to move it to another folder so that I can take a look when I'm not busy. You can ignore emails for components you are not interested (I'm about to write one email for Commons Configuration, it goes to the same mailing list, with prefix [configuration]...).
Hopefully we will find a solution for this issue and fix & release it soon. Thanks!!!