gutenberg icon indicating copy to clipboard operation
gutenberg copied to clipboard

method in WP_HTML_Tag_Processor - get tag content / text content

Open yglik opened this issue 9 months ago • 3 comments

is there a way to get a tag (dom node) content or a text content using the WP_HTML_Tag_Processor class?

lets say i have a bunch of HTML, and from that HTML i want to get a specific tag with a specific class. i want to know what is the text content of that tag, or what some attributes might be.

is there a way to do it using the WP_HTML_Tag_Processor class?

for example: <div class="flex-items-center flex-justify-end d-none d-md-flex my-3"> <button data-disable-invalid="" data-disable-with="" type="submit" data-view-component="true" class="btn-primary btn ml-2"> Submit new issue </button> </div>

i want to get the BUTTON tag with class "btn-primary" then find out what its text content, should be "Submit new issue"

yglik avatar May 07 '24 10:05 yglik

I think that might be possible with new methods introduced in WP 6.5. See the dev note for more details - https://make.wordpress.org/core/2024/03/04/updates-to-the-html-api-in-6-5/.

Mamaduka avatar May 07 '24 13:05 Mamaduka

Yes @Mamaduka is correct, we can now do it with 6.5. This past week I happened to need to build a simple Table of Contents generator that gets all the heading tags, gives them an id if they don't have one, and also gets the inner text of each one to then create a list of links. I could have done it with regex but it was a great opportunity to learn the new HTML API stuff.

I think I've got a handle on how it works now; the biggest part was realizing I can use next_tag() and next_token() in conjunction with each other, and how to identify the tokens I wanted.

i want to know what is the text content of that tag, or what some attributes might be.

  1. To read the value of attributes you can simply use get_attribute( 'id' ) when the processor is pointed at the tag you want. If you want to search for an element with a class, you can do that by passing a query argument to next_tag()
  2. To get the text content is a little more involved than a single function call but still pretty easy using the new next_token() method. The key is that you will scan through potentially multiple tokens, as your button could have nested <strong> tags or other inline markup and we want the inner text only

The documentation on this API and these methods is pretty informative, I would recommend reviewing it https://developer.wordpress.org/reference/classes/wp_html_tag_processor/#methods

Example

For my similar use case I basically set up two nested while loops:

  1. The first outer one loops through our tags like normal
  2. Once we find a tag we want, we use $tags->next_token() to start a new inner while loop, getting the text inside the tag and concatenating it together
  3. When we hit a token that matches the tag we started with and is a closing tag, end the inner loop and continue on to the next tag
$h_tags = array( 'H1', 'H2', 'H3', 'H4', 'H5', 'H6' );

$tags = new WP_HTML_Tag_Processor( $html );

// ----> (1) start looping through tags
while ( $tags->next_tag() ) {

	// I wanted to match an array of tags, so I have an if statement inside, but if you are just
	// looking for one tag you can instead query your tag name directly in `next_tag()` in the while statement
	if ( in_array( $tags->get_tag(), $h_tags, true ) ) {

		$level = (int) str_replace( 'H', '', $tags->get_tag() );
		$id    = $tags->get_attribute( 'id' );

		// set bookmark to come back to in case we need to generate an id from inner text
		$tags->set_bookmark( 'current_heading_start' );

		$text = '';

		// ----> (2) start capturing inner text
		while ( $tags->next_token() ) {
			// we only want to get plain text, skip all other token types
			if ( '#text' === $tags->get_token_type() ) {
				$text .= $tags->get_modifiable_text();
			} elseif ( "H{$level}" === $tags->get_tag() && $tags->is_tag_closer() ) {
				// ----> (3) we got all the inner text, break our inner loop so we can go to the next tag
				$tags->set_bookmark( 'current_heading_end' );
				break;
			}
		}

		// generate a new id and insert into heading tag
		if ( ! $id ) {
			// return to starting tag and update id attribute
			$id = sanitize_title( $text );
			$tags->seek( 'current_heading_start' );
			$tags->set_attribute( 'id', $id );

			// resume and clean up bookmarks
			$tags->seek( 'current_heading_end' );
			$tags->release_bookmark( 'current_heading_start' );
			$tags->release_bookmark( 'current_heading_end' );
		}

		// insert item to toc
		$item = array(
			'text'  => $text,
			'id'    => $id,
			'items' => array(),
		);
	}
}

kurtrank avatar May 13 '24 14:05 kurtrank

Thanks for sharing a great example, @kurtrank!

Mamaduka avatar May 13 '24 15:05 Mamaduka

@kurtrank Thank you for the detailed examples, it realy help me understand what is the process i have to do i norder to accomplish what i want.

i also wanted to add "id" attriubte to each heading, and the make some sort of table of contents. and just like you i thought this is a great opportunity to learn about WP HTML API

it was difficult for me to pinpoint what methods i should have used in order to extract the text of an element (what in JS referred to as textContent or innerText) because in all the previous tools, like js or simple html dom which i use often, or other server side dom parsers, the name of the text is, textContent or something of that sort. its a bit paradigm shift to think of the text as a token in that html (which i guess that under the abstraction levels it is that)

and i did searched through the docs of the HTML API and could find it.

i guess it could be beneficial to add some abstraction to the HTML API to be more like other tools, lets say, a method that retrives the textContent of a tag (or better to say a node or a dom element in that context) also, the documentatino can benefit from a wider vary of use cases, which i suppose is always good.

i realy hope this thread will help people searching something like this to use in the WP HTML API in the future

thanks again

yglik avatar May 18 '24 13:05 yglik