ALR hashes are injected to the first found matching tag, which might not be the targeted one

Open MathieuLamiot opened this issue 1 year ago • 0 comments

Context As part of this change on LRC feature, one of the identified issue with the implementation is that a hash might be injected to the wrong opening tag. The strategy currently is to parse the HTML with DOMDocument to identify opening tags to add a hash to. Then, we get this opening tag and look for it in the actual HTML, and add the hash to the first matching occurrence. If there is exactly the same tag as the targeted one, earlier in the HTML, that did not get the hash because it is not eligible, (typically too deep), then we inject the hash to the wrong tag. Here is an example (from this fixture):

<html>
	<head>
		<title>Original</title>
	</head>
	<body>
		<main>
			<article>
				<header>
					<h1>Original</h1>
					<div></div>
				</header>
				<section>
					<h2>Text</h2>
					<p>Original content</p>
				</section>
			</article>
		</main>
		<footer>
			<div>
				<p>Original footer</p>
			</div>
		</footer>
	</body>
</html>

The first

is not eligible to a hash because it is deeper than level 3 compared to body. The second

is eligible. So we look for

in the HTML here and end up replacing the first

instead of the second one.

Expected behavior The LRC hash must be injected to the element identified by DOMDocument, especially when there are multiple times the same opening tag in the HTML, and that the first one is not eligible because it is too deep.

Acceptance Criteria

Remove attribute-added-to-bypass-dom-processor-known-issue in all tests and fixtures. Tests should pass.
The issue should not be reproducible on the identified templates:
- https://wp-media.slack.com/archives/CUT7FLHF1/p1725892450082049
- https://wp-media.slack.com/archives/CUT7FLHF1/p1725624439071799

Additional information I conducted some experiment a few months ago on a similar issue, and it could be useful here. See this code. The idea here was to keep track of the position of the last opening tag processed and to to the search&replace starting from this position, so that we don't process what is earlier in the file. This is done with the PREG_OFFSET_CAPTURE option of preg_match_all, which we could try to use here. On the example above, we would inject the

tag, then look for

starting from the

tag, hence we won't be picking the first div, but the correct one only. The only limitation here is on a case like the following, where there is nothing to process between the wrong and the targeted opening tag. So this solution is not perfect, but acceptable if we don't have other ideas.

<html>
	<head>
		<title>Original</title>
	</head>
	<body>
		<main>
			<article>
				<header>
					<h1>Original</h1>
					<div></div>
				</header>
			</article>
		</main>
		<div>
			<p>Original footer</p>
		</div>
	</body>
</html>

Sep 10 '24 15:09 MathieuLamiot