commons-text icon indicating copy to clipboard operation
commons-text copied to clipboard

TEXT-215: Prevent decimal numeric entities from wrongly including hexadecimal characters

Open rbunel35 opened this issue 2 years ago • 12 comments

Hello, This a quick bugfix on the NumericEntityUnescaper. The bug allows decimal characters entities without semi-colon and followed by a letter from A to F to be ignored by the translator. A full description of the problem is found in the ticket: https://issues.apache.org/jira/browse/TEXT-215

rbunel35 avatar Mar 25 '22 06:03 rbunel35

Hi @kinow ! Thanks for the quick review. I just added a unit test for the "semiColonOptional" option which asserts the unescaping is working for both hexadecimal and decimal entities, with and without semi-colon.

rbunel35 avatar Mar 25 '22 10:03 rbunel35

Great! I've allowed the workflow to run. Now just need to find time to review issue & code (though that's rather a small change). Thanks for the pull request @rbunel35 !

kinow avatar Mar 25 '22 10:03 kinow

Thank you very much !

rbunel35 avatar Mar 25 '22 10:03 rbunel35

Hello ! Do you have any news to give me about this fix ? Thanks in advance :)

rbunel35 avatar Apr 27 '22 08:04 rbunel35

https://www.w3.org/TR/REC-xml/#dt-charref

Why are illegal entities allowed in the first place? Am I reading the specification incorrectly? The ';' character should be required. IMO this feature creep on our end feels improper and should not be allowed or at the very least deprecated.

garydgregory avatar Apr 27 '22 11:04 garydgregory

https://www.w3.org/TR/REC-xml/#dt-charref

Why are illegal entities allowed in the first place? Am I reading the specification incorrectly? The ';' character should be required. IMO this feature creep on our end feels improper and should not be allowed or at the very least deprecated.

Good point. I haven't checked any specification yet, but this:

# File: test.html
<iframe src="&#106avascript:alert(1)">

Or this:

# File: test.html
<iframe src="&#106;avascript:alert(1)">

Both trigger an alert (tested with python3 -m http.server and visit http://localhost:8000/test.html). I think the JIRA issue mentions how browsers handle this payload, so I suspect users could expect Commons Text to translate it in a similar way (not saying that it's correct or not, and whether we should do it or not :+1: , just FWIW)

kinow avatar Apr 27 '22 11:04 kinow

My personal opinion is that we should stick to a specific version of a specification, in this case W3C XML. If we also want to emulate what a browser does or what another language does, then that could be the job for a subclass. So maybe we should be refactoring the code? Needs other opinions...

garydgregory avatar Apr 27 '22 11:04 garydgregory

Hi, Thank you for the your answers.

Indeed I understand that semicolon-less character entities do not form part of the specification, however as pointed by @kinow, virtually all modern browsers read them as valid entities anyway, and as a user of Commons Text, I expected it to work the same way.

Also, please note that the acceptance of numeric character entities without semi-colon precedes my contribution (via the "semiColonOptional" value in the enum). My fix only makes it work correctly with hexadecimal entities.

rbunel35 avatar Apr 27 '22 12:04 rbunel35

Hey @kinow,

I hope I'm not too much impatient (sorry if that's the case ^^) but do you have news about this PR ? You mentioned I could bump you if it took too long.

Thanks in advance !

rbunel35 avatar May 10 '22 05:05 rbunel35

Hey @kinow,

I hope I'm not too much impatient (sorry if that's the case ^^) but do you have news about this PR ? You mentioned I could bump you if it took too long.

Thanks in advance !

Hi @rbunel35

Thanks for the remainder. I'm starting a new job in a few months, and will have about one month with extra spare time, when I expect to be able to release Text if nobody else beats me to it.

I think we will have to move this question to a wider audience, @rbunel35 . We have two options here, I can send the email to the Commons Dev mailing list explaining what the PR is fixing, and trying to explain on the difference between browsers & specs, or if you are subscribed you can send the email over there.

The only difference, I think, is that I may forget the issue again if there are no replies over there :smile: in which case you'd have to ping me again. Up to you :+1:

Thanks!!! Bruno

kinow avatar May 10 '22 05:05 kinow

Thanks for the quick answer ! I'm not subscribed yet, but will do it right now, and then I'll send the email to explain my PR. Thank you very much !

rbunel35 avatar May 10 '22 05:05 rbunel35

Thanks for the quick answer ! I'm not subscribed yet, but will do it right now, and then I'll send the email to explain my PR. Thank you very much !

Brilliant!

Every component is discussed there, that's the only downside. Use the following prefix for your subject, please: "[text] Enter your email subject here".

I have a rule in GMail to move it to another folder so that I can take a look when I'm not busy. You can ignore emails for components you are not interested (I'm about to write one email for Commons Configuration, it goes to the same mailing list, with prefix [configuration]...).

Hopefully we will find a solution for this issue and fix & release it soon. Thanks!!!

kinow avatar May 10 '22 05:05 kinow