wordpress-develop KSES: Allow Custom Data Attributes having dashes in their dataset name.

Trac ticket: Core-61501. (was Core-61052) Alternative in #6598

Allow for additional custom data attributes in wp_kses_attr_check().

In this patch the set of allowable custom data attribute names is expanded to permit those including dashes in the dataset name. The term dataset name here means the name appearing to JavaScript in a browser. This is found by stripping away the data- prefix, lowercasing, and then turning every dash-letter combination into a capital letter.

The Interactivity API relies on data attributes of the form data-wp-bind--enabled. The corresponding dataset name here is wpBind-Enabled, and is currently blocked by wp_kses_attr_check(). This patch allows that and any other data attribute through which would include those dashes in the translated name.

This patch is structured in two parts:

Ensure that Core can recognize whether a given attribute represents a custom data attribute.
Once recognizing the custom data attributes, apply a WordPress-specific set of constraints to determine whether to allow it.

The effect of this change is allowing a double-dash when previously only a single dash after the initial data- prefix was allowed. It's more complicated than this, because, as evidenced in the test suite, double dashes translate into different names based on where they appear and what follows them in the attribute name.

Code used to produce this table

<style>
td, th {
	margin-right: 1em;
}
</style>
<?php

require_once __PATH_TO_REPO__ . '/wordpress-develop/src/wp-load.php';

$variations = [
	'data-',
	'post-id',
	'data-post-id',
	'data-Post-ID',
	'data-uPPer-cAsE',
	"data-\u{2003}",
	'data-🐄-pasture',
	'data-wp-bind--enabled',
	'data-over_under',
	'data--before',
	'data-after-',
	'data-after-after--',
	'data--one--two--',
	'data---one---two---',
	'data-[wish:granted]',
];

echo '<table><tr><th>Attribute Name<th>Dataset Name<th>Before<th>After</tr>';
foreach ( $variations as $name ) {
	$_name    = $name;
	$_value   = 'some-value';
	$_whole   = "{$_name}=\"{$_value}\"";
	$_vless   = 'n';
	$_element = 'div';
	$_allowed_html = [ 'div' => [ 'data-*' => true ] ];

	$was_allowed = wp_old_kses_check( $_name, $_value, $_whole, $_vless, $_element, $_allowed_html );
	$was_allowed = $was_allowed ? '✅' : '🚫';

	$_name    = $name;
	$_value   = 'some-value';
	$_whole   = "{$_name}=\"{$_value}\"";
	$_vless   = 'n';
	$_element = 'div';
	$_allowed_html = [ 'div' => [ 'data-*' => true ] ];

	$is_allowed = wp_kses_attr_check( $_name, $_value, $_whole, $_vless, $_element, $_allowed_html );
	$is_allowed = $is_allowed ? '✅' : '🚫';

	echo <<<HTML
	<tr>
	<th><code>{$name}</code>
	<td style="font-family: monospace;" is-data {$name}>{$name}
	<td>{$was_allowed}
	<td>{$is_allowed}
	HTML;
}
echo '</table>';

?>
<script>
document.querySelectorAll( '[is-data]' ).forEach( node => {
	node.innerText = Object.keys( node.dataset ).join( ', ' );
} );
</script>
<?php

function wp_old_kses_check( &$name, &$value, &$whole, $vless, $element, $allowed_html ) {
	$name_low    = strtolower( $name );
	$element_low = strtolower( $element );

	if ( ! isset( $allowed_html[ $element_low ] ) ) {
		$name  = '';
		$value = '';
		$whole = '';
		return false;
	}

	$allowed_attr = $allowed_html[ $element_low ];

	if ( ! isset( $allowed_attr[ $name_low ] ) || '' === $allowed_attr[ $name_low ] ) {
		/*
		 * Allow `data-*` attributes.
		 *
		 * When specifying `$allowed_html`, the attribute name should be set as
		 * `data-*` (not to be mixed with the HTML 4.0 `data` attribute, see
		 * https://www.w3.org/TR/html40/struct/objects.html#adef-data).
		 *
		 * Note: the attribute name should only contain `A-Za-z0-9_-` chars,
		 * double hyphens `--` are not accepted by WordPress.
		 */
		if ( str_starts_with( $name_low, 'data-' ) && ! empty( $allowed_attr['data-*'] )
			&& preg_match( '/^data(?:-[a-z0-9_]+)+$/', $name_low, $match )
		) {
			/*
			 * Add the whole attribute name to the allowed attributes and set any restrictions
			 * for the `data-*` attribute values for the current element.
			 */
			$allowed_attr[ $match[0] ] = $allowed_attr['data-*'];
		} else {
			$name  = '';
			$value = '';
			$whole = '';
			return false;
		}
	}

	if ( 'style' === $name_low ) {
		$new_value = safecss_filter_attr( $value );

		if ( empty( $new_value ) ) {
			$name  = '';
			$value = '';
			$whole = '';
			return false;
		}

		$whole = str_replace( $value, $new_value, $whole );
		$value = $new_value;
	}

	if ( is_array( $allowed_attr[ $name_low ] ) ) {
		// There are some checks.
		foreach ( $allowed_attr[ $name_low ] as $currkey => $currval ) {
			if ( ! wp_kses_check_attr_val( $value, $vless, $currkey, $currval ) ) {
				$name  = '';
				$value = '';
				$whole = '';
				return false;
			}
		}
	}

	return true;
}

These are recommendations. If these naming recommendations are not followed, no errors will occur. The attributes will still be matched using CSS attribute selectors, with the attribute being case insensitive and any attribute value being case-sensitive. Attributes not conforming to these three recommendations will also still be recognized by the JavaScript HTMLElement.dataset property and user-agents will include the attribute in the DOMStringMap containing all the custom data attributes for an HTMLElement.

https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/data-*

Apr 22 '24 18:04 dmsnell

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance, it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Apr 22 '24 18:04 github-actions[bot]

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell, peterwilsoncc.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Apr 24 '24 09:04 github-actions[bot]

I dug into the relevant section of the spec, but I think the interesting point here is that data attributes in the browser are handled much more loosely (as already mentioned in the description). The spec seems to explicitly disallow ASCII uppercase and : in data attributes, but Chrome doesn't seem to care at all.

<div data-XYZ:ABC="works">ok</div>

document.querySelector('div').attributes.getNamedItem('data-xyz:abc').nodeType === document.ATTRIBUTE_NODE
// true
document.querySelector('div').attributes.getNamedItem('data-xyz:abc').nodeName
// "data-xyz:abc"
document.querySelector('div').attributes.getNamedItem('data-xyz:abc').nodeValue
// "works"
document.querySelector('div').dataset['xyz:abc'];
// "works"

specs and references

A custom data attribute is an attribute in no namespace whose name starts with the string "data-", has at least one character after the hyphen, is XML-compatible, and contains no ASCII upper alphas.

(then)

Attribute names are said to be XML-compatible if they match the Name production defined in XML and they contain no U+003A COLON characters (:). [XML]

(finally):

Names and Tokens
[4]  NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar      ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]  Name          ::= NameStartChar (NameChar)*
[6]  Names         ::= Name (#x20 Name)*
[7]  Nmtoken       ::= (NameChar)+
[8]  Nmtokens      ::= Nmtoken (#x20 Nmtoken)*

May 21 '24 16:05 sirreal

I dug into the relevant section of the spec, but I think the interesting point here is that data attributes in the browser are handled much more loosely (as already mentioned in the description). The spec seems to explicitly disallow ASCII uppercase and : in data attributes, but Chrome doesn't seem to care at all.

this is going to be another case where it's evident that the HTML5 specification is a family of interrelated specifications for different purposes. there are different rules for what we allow producing than there are for what we demand for reading.

for example, attributes aren't allowed to be created with brackets in their name, like [if]="var.hasAccent" and yet every parser must recognize this as an attribute. so it's more or less a blur and everything that's allowed is spec, but then we "shouldn't" do things that are against the rules, but they are benign.

on data attributes what I was most concerned about is matching browser behavior. we should have a direct correspondence between the attributes on the server that appear in a dataset for an Element in the browser. if we find cases that all three browsers reject, let's add those in the test suite and note the omission.

May 21 '24 20:05 dmsnell

I've rearranged the code somewhat, doing what I originally wanted to avoid, in creating a new function. Still, the desire to add a test suite overcame my hesitation.

Now, I've added wp_kses_transform_custom_data_attribute_name, which is a mouthful, but does exactly what I believe Core needs to do. That is, it needs to start by recognizing what is and what isn't a data attribute.

Beyond this, we can talk about arbitrarily rejecting some of them, but we can add that on top of this work.

That is, if we only want to allow a subset of data attributes, we can start inside the if where we've detected one, and then reject it, knowing for sure we didn't miss any.

By the way, I ran a test in Safari, Chrome, and Firefox to confirm the test cases I added. They all operate uniformly. Surprisingly, data- (with nothing after the one -) is a valid data attribute whose name is the empty string. This test confirms the relationship between attribute name and appearing in an element's dataset property.

Test Code

<?php

$variations = [
	'data-',
	'post-id',
	'data-post-id',
	'data-Post-ID',
	"data-\u{2003}",
	'data-🐄-pasture',
	'data-wp-bind--enabled',
	'data--before',
	'data-after-',
	'data---one---two---',
	'data-[wish:granted]',
];

foreach ( $variations as $name ) {
	echo "<div><code>{$name}</code>: <span {$name}='{$name}'></span></div>\n";
}

?>
<script>
document.querySelectorAll( 'span' ).forEach( node => {
	node.innerText = JSON.stringify( node.dataset );
} );
</script>

May 22 '24 00:05 dmsnell

Interestingly, before this change data--before would have been rejected while afterwards it's allowed. The name of the dataset value doesn't have any dashes though, it's Before

May 22 '24 18:05 dmsnell

@peterwilsoncc I've refactored the test in 5294a86

Also I've updated the PR description to reflect how this PR is no longer attempting to allow all legitimate custom data attributes, but still only those WordPress special-cases, plus the ones with a dash in their dataset name.

Of note, this is slightly different than the approach in #6598. Instead of deciding to accept or reject based on the attribute name, in this patch WordPress is basing the decision off of the translated name. This subtlety means that the test is more directly related to the name of the custom data attribute.

For example, if we only test for the presence of the double dash --, we are comparing that against three different outputs: one with no dashes, one with one dash, and one with two dashes. This is evidenced in the screenshot of the table at the start of this post with the custom data attribute data--one--two-- whose name in JavaScript becomes One-Two--. Each double-dash in that name corresponds to a different translation.

Now, the rule to allow a - in the translated name implicitly allows double-dashes in the attribute name because that's the only way to produce one in the translated name. For example, wpBind-Enabled.

May 22 '24 18:05 dmsnell

@sirreal what are your thoughts on this vs. #6598? and also about @peterwilsoncc's thoughts.

I'm a little unclear as to the need for the new function as it's the attribute name that KSES is concerned about, the validation regex within the existing function should be for convertible names.

Personally I feel like I could be convinced both ways, but there's this part of me, especially after working with so many issues in WordPress' HTML handling and kses issues, that starting with the specification and then adding constraints is going to pay off over starting with a mix of special constraints and ad-hoc parsing.

That is, for example, maybe today kses rejects anything that looks like a custom data attribute, but tomorrow some other code tries to reuse the logic and misses some valid ones, or lets through one that it shouldn't, because its test wasn't "is this a custom data attribute" but rather "does this attribute follow this regex?" (By the way, @peterwilsoncc, this is another phrasing of my run-around reason for including the function - to define a new semantic which can answer the question "is this a custom data attribute?")

May 23 '24 18:05 dmsnell

what are your thoughts on this vs. https://github.com/WordPress/wordpress-develop/pull/6598?

I'm happy to see this PR move forward if we're happy with it, it seems more complete. My only intention with #6598 was to have a PR that was small and easy to land. I'm happy to close that PR.

May 23 '24 19:05 sirreal

@dmsnell I can see it leading to confusing if the new function doesn't return null for a value that will end up being stripped:

wp> wp_kses_transform_custom_data_attribute_name( "data-love-a-little-🐮-love-a-little-tippin'" )
=> string(34) "loveALittle-🐮LoveALittleTippin'"
wp> wp_kses_post( "<div data-love-a-little-🐮-love-a-little-tippin'>Yes, a musical theatre reference</div>" )
=> string(43) "<div>Yes, a musical theatre reference</div>"

As it's a kses function it suggests support that isn't available. Generally KSES sub-functions aren't intended to be called directly but that can change over time, as we've seen with safecss_filter_attr()

May 23 '24 23:05 peterwilsoncc

@peterwilsoncc 🤔 many of the kses sub-functions are there just for parsing, like wp_kses_hair and wp_kses_hair_parse and wp_kses_attr_parse

would it make more sense to add this into the HTML API and then have kses call it? it would have a clear home there as being a semantic of the HTML layer.

as a reminder, my purpose is to fully vet and adapt this solution for the feedback, not to push to adopt it or choose it in the end.

May 24 '24 00:05 dmsnell

as a reminder, my purpose is to fully vet and adapt this solution for the feedback, not to push to adopt it or choose it in the end.

Yeah, I've got ya. We're working to the same end & it's a complex question :)

would it make more sense to add this into the HTML API and then have kses call it? it would have a clear home there as being a semantic of the HTML layer.

That works as the HTML API seems more about parsing without regard to kses. As the maintainer of the HTML API is that something you are happy with? If it's not something you had planned and don't think is a valid use of the feature then don't feel the need to wedge it in there on my account.

tomorrow some other code tries to reuse the logic and misses some valid ones, or lets through one that it shouldn't, because its test wasn't "is this a custom data attribute" but rather "does this attribute follow this regex?

Are you able to clarify this a little more?

I'm still trying to get clear on the purpose of the new function in my head and how it differs from the regex validation.

May 24 '24 01:05 peterwilsoncc

thanks again @peterwilsoncc, and sorry for my late response, which is due to some transit that threw me off.

That works as the HTML API seems more about parsing without regard to kses.

I'm happy to put this in there since it relates to the spec issues. Originally we had a WP_HTML_Spec class in the HTML API proposal but we removed it. It was basically an all-static class with knowledge like this from the HTML spec, and no other stateful or complicated logic.

The only challenge I see in moving into the HTML API right now is that it's not clear where to put it, and that would probably take some time to see how everyone feels.

As an alternative, we could inline the code for now and move it over when we have a stronger idea where it belongs. Or I guess we could still leave it as its own function so that we can leave in the tests for it, but communicate that it's not to be used or called from the outside, and that it will probably move. As in, maybe we can keep it here for now for practical reasons and defer the bigger design question until we can answer it.

Are you able to clarify this a little more?

Thank you for asking. I don't mean to be unclear, but I don't always get things out very well.

I could change the question here a bit. This entire patch and need to open up kses is complicated by the fact that it's unclear what Core wants or expects. Here's was the existing code claims…

/**
 * Allow `data-*` attributes.
 *
 * …
 * 
 * Note: the attribute name should only contain `A-Za-z0-9_-` chars,
 * double hyphens `--` are not accepted by WordPress.
 */

The existing code makes a claim for the purpose of the regex but then immediately contradicts that claim by specifying new rules. It's there to allow data-* attributes, but then also it's there to only allow some of them. It doesn't explain why these new restrictions are in place or who decided. It's evident that there are additional rules being imposed, but there's no explanation for a process to determine how and if it's safe to change those rules, or know if there's a bug in the original implementation.

For example, should Core allow data-________? It's allowed by the existing regex, but it looks wrong. What makes it wrong? Looks?

Getting back to your question, by separating the question into two steps I find it easier to disambiguate situations like these: is this a custom data attribute? is this one Core allows?

We have a clear goalpost for proper parsing, so we don't have to ask ourselves if the regex was intentionally overlooking other custom data attributes vs. whether it was accidentally overlooking them. Since this entire patch is about changing what WordPress allows, we can focus on WordPress additional rules and not conflate that with how to properly detect legitimate ones. We can change our own sanitization without ever effecting the underlying parsing code.

Anyway I hope I haven't made this more confusing. This is all about making it easier to talk about and verify that code we write is correct and about separating the act of detecting things from allowing some of those things we detect.

It's similar to using the HTML API to find IMG tags with a given class first by finding all of the legitimate IMG tags and then applying further rules, vs. attempting to find the subset directly with some pattern like ~<img src="https://[^"]+"[^>]>~ - it invites conflation of spec. vs. custom behaviors which open up opportunity to miss things.

Edit: I thought of a more concrete example I came upon that might help. In the test suite there's a case of data--before. It won't pass the old regex, but this doesn't have any dashes in it when read from element.dataset.Before. Now we can ask ourselves, was it intentional that Core rejected it, or did nobody think of this particular case? Another test case uses data--one--two--three in part to demonstrate how the dashes on the HTML attribute don't have a 1:1 correspondence with the translated name on the other side (it translate into One-Two--, where each double hyphen corresponds to zero, one, and two hyphens in JavaScript).

May 30 '24 04:05 dmsnell

Thanks for clarifying.

I've been considering this over the weekend and my strong inclination is to go with the source code changes in PR #6598 to allow for the lightest touch changes to land in core.

Doing some quick testing, I can't see it leading to problems but it's worth coming up with additional edge cases. (Also doing quick tests, I can see that some of the browser interpretations are interesting https://jsbin.com/devefix/edit?html,js,console )

That said, if you can think of problematic cases it would lead to, I am happy to listen. It's best to know about these things pre- rather than post-commit.

The only challenge I see in moving into the HTML API right now is that it's not clear where to put it, and that would probably take some time to see how everyone feels.

As an alternative, we could inline the code for now and move it over when we have a stronger idea where it belongs. Or I guess we could still leave it as its own function so that we can leave in the tests for it, but communicate that it's not to be used or called from the outside, and that it will probably move. As in, maybe we can keep it here for now for practical reasons and defer the bigger design question until we can answer it.

If we're to validate the dataset, I'd be inclined to inline it for the time being if the intent is to deprecate in the shortish term.

In the past Core has removed private delegations because it didn't reflect reality (List Tables being the poster-child for this) so we'd need to keep it around. Inlining also allows a more considered approach to inclusion in the HTML API while you figure out where best to place it.

Jun 03 '24 00:06 peterwilsoncc

If you feel happy with #6598 then let's merge that in and leave this open for further consideration. The two aren't contradictory, but this one encompasses more choice. I'll approve that one, and if nobody has merged it by later today I'll try and do that. The most important thing to me is that we allow these where WordPress currently removes them for the Interactivity API's purposes.

In aa63e2b0756f21043b74a8cc1fe45bc4ceddf552 I added your additional test cases. Cases like these are exactly why I approached the problem the way I did, because intuition of the mapping from HTML to JavaScript can be fuzzy, meaning the rules we apply in PHP are in high risk for being different than we think they are unless we write the rules to the output of that transformation.

Thanks @peterwilsoncc!

Jun 03 '24 07:06 dmsnell

We should update the trac ticket for this to https://core.trac.wordpress.org/ticket/61501. This would be an enhancement targeting the 6.7 release.

Jun 25 '24 16:06 sirreal