html-purify Fix tag balancing

Fix tag balancing

Open adon-at-work opened this issue 9 years ago • 15 comments

Changes

fix: optional tags are ignored during balancing, i.e., well-balanced input will not be broken by attempting to close optional tags.
tag balancing logics simplified (accept closing tag as long as it exists in the stack)
ignore self-closing solidus, which is required only for foreign elements
added a cap to limit the max levels of unclosed/nested tags

before the fix default x 0.64 ops/sec ±5.90% (6 runs sampled) disabled tag balancing x 0.60 ops/sec ±3.82% (6 runs sampled) Fastest is [ 'default' ] Speed/Time is 0.6423016849725829MB/s 1.5567777936666667s

after the fix default x 0.72 ops/sec ±5.01% (6 runs sampled) disabled tag balancing x 0.60 ops/sec ±3.91% (6 runs sampled) Fastest is [ 'default' ] Speed/Time is 0.7190234217142741MB/s 1.3906654078333334s

it actually runs faster than without it using the given 1m file. so tag balancing doesn't seem to incur any significant performance hit.

Aug 17 '15 09:08 adon-at-work

@maditya, please check. thanks :)

Aug 17 '15 10:08 adon-at-work

can we track updates to benchmark suite in a separate PR ?

Aug 17 '15 20:08 maditya

ok. rebased to separate it to PR https://github.com/yahoo/html-purify/pull/25

Aug 18 '15 03:08 adon-at-work

@maditya , as discussed over phone, i optimized the code a bit, and imposed a cap on stack size.

Aug 18 '15 10:08 adon-at-work

@maditya, @adon-at-work

Please see http://jsperf.com/array-push-int-and-string for performance with various array push technique. Since we are having a whitelist of tags, I think we would prefer using integer (enum based) assignment to the tag list stack (array).

There may be other optimization techniques, please feel free to edit the jsperf test for such.

Sep 03 '15 19:09 yukinying

@yukinying I see you're suggesting to deal with integers instead of tagName (string), and thanks for the detailed performance tests. :)

Right now, the tagName whitelist is like ['a', 'div', ...], and I'm aware of making use of the array indices as the integers we want.

@yukinying @maditya Looking farther, it boils down to whether we'll support tag-specific attribute whitelist, e.g., href is whitelisted for <a> but not <div>, as in the current Mail or YIV use case. This will for sure affect the list structure. As a reference, secure-handlebars attempted to support similar tag-specific attribute matching by: https://github.com/yahoo/secure-handlebars/blob/master/src/parser-utils.js#L68-L70. Here, we may need a very different structure.

Hence, should we now keep the current implementation as is before settling the use case regarding tag-specific attribute whitelist and its required the list structure? And perhaps, (if the list structure really requires a change,) let's do it in separate PRs.

Sep 10 '15 01:09 adon-at-work

The suggestion there is to avoid using Array.prototype.push as it is not optimized for the use case. Instead, keeping a counter of the stack size and use it to mimic the push and pop operation will give you a huge performance boost.

A further boost can be possible if we use enum based approach for identifying the tag name, as the space required for building the stack will be much smaller.

Please take the suggestion on avoiding use of array push as a basic requirement for the tag balancing stack. Once that's done, we should be able to relax the requirement of stack size limitation (e.g. grow it to 100,000) to some value that we would consider the input as totally malicious / invalid, and then just return error for the case.

Sep 11 '15 07:09 yukinying

I appreciate your suggestions, and indeed, thank you for a midnight comment. :+1:

But there has no more .push since 24 days ago. Such an optimization is completed in https://github.com/yahoo/html-purify/commit/fa24fa2ee3b184696e2f9f36a8a04cffc85dae31#diff-9c568acf92dc1a69fdd058c41195c719R121. You may like to confirm by searching for .push after clicking "Files Changed".

I personally think legitimate HTML won't nest more than a maximum of 100 levels. 100,000 sounds way too relax perhaps. Will you consider/accept 1,000 or 10,000?

Regarding "enum based approach for identifying the tag name", I also see its potential as you do. But could we not to do that in this PR before we finalize the enum/whitelist structure?

Sep 11 '15 07:09 adon-at-work

I would rather use some measurable metrics to determine the level. Could you provide a simple benchmark with nested tags with different levels, e.g. 100, 1000, 10000?

We will also need a doc (or MD) file that describes the actual tag balancing algorithm. The doc you sent me previously has too many open questions there. My preference is to have a document that clearly explains the algorithm, and we use the algorithm spec to validate if any future pull requests or bug fixes from any contributors would align with the design principles and security consideration of such.

It'd actually be much nicer if the commit https://github.com/yahoo/html-purify/commit/fa24fa2ee3b184696e2f9f36a8a04cffc85dae31 has come along with a description of the changes.

Sep 11 '15 08:09 yukinying

Looking farther, it boils down to whether we'll support tag-specific attribute whitelist, e.g., href is whitelisted for but not
, as in the current Mail or YIV use case. This will for sure affect the list structure. As a reference, secure-handlebars attempted to support similar tag-specific attribute matching by: https://github.com/yahoo/secure-handlebars/blob/master/src/parser-utils.js#L68-L70. Here, we may need a very different structure.
Hence, should we now keep the current implementation as is before settling the use case regarding tag-specific attribute whitelist and its required the list structure? And perhaps, (if the list structure really requires a change,) let's do it in separate PRs.

Agree that the matrix of allowable tags and attributes combination should come as a separate PR.

Sep 11 '15 08:09 yukinying

I would rather use some measurable metrics to determine the level. Could you provide a simple benchmark with nested tags with different levels, e.g. 100, 1000, 10000?

I'm not sure if I misunderstood.

I perceive that the cap is more related to what the norm is, for which no legit inputs will get truncated because of exceeding the cap. As long as the cap is not hit, 100 and 100,000 won't make a difference in terms of performance (stackPtr < 100 vs. stackPtr < 100000).

In fact, I'm confused about the relationship between the cap and anything related to benchmark, including the worst-case performance you sort of mentioned a while ago. Assume we've a 1m file. Say, we encountered 300KB (e.g., <b> x 100) of tags that are so nested to hit the cap, then the remaining 9700KB of data will be ignored. One can obviously tell the performance is even better once hitting the cap, as no further inputs will be consumed.

Determining that big enough number may sound magical, but more accurately it's very use-case specific. The best effort is to set a sufficiently big number. 100, 1000, 10000 all look reasonable to me. The best approach is to have it configurable as already did, and let users judge based on their needs. May be Mail never want to truncate people's emails to avoid complaints, then the number could be MAX of what a Number can hold.

Sep 11 '15 09:09 adon-at-work

We will also need a doc (or MD) file that describes the actual tag balancing algorithm. The doc you sent me previously has too many open questions there. My preference is to have a document that clearly explains the algorithm, and we use the algorithm spec to validate if any future pull requests or bug fixes from any contributors would align with the design principles and security consideration of such.

MD (or even comments inside the code) is good, since it can be included in this open-sourced project. The description on tag balancing in already included in the README.md. I'd very much appreciate if you could directly edit/improve/polish it over there. Thanks.

Sep 11 '15 09:09 adon-at-work

The description on tag balancing in already included in the README.md. I'd very much appreciate if you could directly edit/improve/polish it over there. Thanks.

We need more than a paragraph, and we need the algorithm rather than an overview. Without such, this pull request would not be accepted.

Sep 11 '15 16:09 yukinying

MD (or even comments inside the code) is good

Also, we would not accept source code comment as a proper documentation unless they are gathered in one big comment block. The reason is that it is very easy to overlook any potential new pull requests that may wipe a single line comment or moved certain code logic and mess up all the remaining comments. The requirement may not be necessary for other libraries, yet we try to set a higher bar to make this security library be more easier manageable.

Sep 11 '15 16:09 yukinying

please find the changes rebased to the latest master. comments added too. @maditya @yukinying

Oct 05 '15 09:10 adon-at-work

html-purify html-purify copied to clipboard

Fix tag balancing

html-purify
html-purify copied to clipboard