htmldiff icon indicating copy to clipboard operation
htmldiff copied to clipboard

Invalid HTML output in 1.0.0 (pre-release) of diffed HTML content

Open silent-e opened this issue 9 months ago • 1 comments

Apologies for the length but I wanted to be thorough.

First, it's SOOO great this is getting updated! Thank you for all the work. I inherited an app that was diffing by using a file copied from v0.0.1 at this commit and I started looking for where the original dev got their code since they didn't document it and wasn't using it as a gem. This issue is based on the latest master branch version, commit cc4f7f.

Here's a simple example that is causing me some problems. Use this sample old and new text to diff.

old

<p>Lorem ipsum dolor sit amet. foo</p>

new

<p>Lorem ipsum dolor sit amet.</p>
<p>New paragraph</p>
<p>And yet another new paragraph</p>

It's IMHO an example that could be quite common. New paragraphs added, and a small text change.

Issue one is how they're getting diffed. Neither recognize that they should identify each paragraph separately, but instead see the opening tag of the first paragraph matching with the closing tag of the last paragraph. The 1.0.0 tokenizer output is correct.

Issue two is the invalid HTML returned from 1.0.0. It's how the <ins> tags are written into the diff output. Here's how the diff output differs (heh) between 0.0.1 and 1.0.0. I'm using the compatibility html_format setting in UPGRADE.md for the 1.0.0 output.

0.0.1 - valid html

<p>Lorem ipsum dolor sit amet.</p><ins class="diffins">
</ins><p><ins class="diffins">New</ins> <del class="diffmod">foo</del><ins class="diffmod">paragraph</ins></p><ins class="diffins">
</ins><p><ins class="diffins">And yet another new paragraph</ins></p><ins class="diffins">
</ins>

1.0.0 - invalid html

<p>Lorem ipsum dolor sit amet.<del class="diffmod"> foo</del><ins class="diffmod"></p>
<p>New paragraph</ins></p><ins class="diffins">
<p>And yet another new paragraph</p>
</ins>

The 1.0.0 returned diff is invalid HTML. It got <del> content correct but the first set of <ins> tags are wrong because of how it interpreted the paragraphs. The second <ins> tag is correct but what I think the final diff should look like is this ...

<p>Lorem ipsum dolor sit amet.<del class="diffmod"> foo</del></p>
<ins class="diffins">
<p>New paragraph</p>
<p>And yet another new paragraph</p>
</ins>

With the way the 1.0.0 diff is written the <ins> tag is trying to span from inside a block-level element, then close inside a second block-level element, inside the second paragraph. The browser interprets the 1.0.0 output the best it can by trying to close open tags and removing invalid closing tags. What ultimately gets rendered is shown here using screenshots from Safari's web inspector.

0.0.1

browser rendered source visual output with styling
Image Image

1.0.0

browser rendered source visual output with styling
Image Image.

Version 0.0.1 didn't diff correctly either but it did return valid HTML.

The W3C html validator reports this for the 1.0.0 output ...

Image

Thoughts? I understand that it's meant more for diffing text and not html but IMHO it's the only gem I can find that gets close to getting it right.

(e)

silent-e avatar Mar 13 '25 21:03 silent-e

AWESOME 👍 thank you so much for reporting. I think the correct solution will require adding Nokogiri to the library, so that we can walk the DOM tree structure.

Please also note that version 1.0.0 will remove the "diffmod", etc. classes by default (according to my latest changes). You can add it back with HTMLDiff.diff(a, b, html_format: { class_delete: "diffdel", class_insert: "diffins", class_replace: "diffmod" })

johnnyshields avatar Mar 14 '25 01:03 johnnyshields

Thank you!

I think I met similar issues - the diffs are messing up HTML structure.

Example oldText: <div><div><span>abc</span></div></div> newText: <div><span>abc</span></div>

Generated diff:

<div><<del>div><</del>span>abc</span<del>></div</del>></div>

One of the unit tests is actually outputting invalid HTML in a similar way

      context 'when diffing escaped HTML tags' do
        let(:old_text) { '&lt;div class="old"&gt;content&lt;/div&gt;' }
        let(:new_text) { '&lt;div class="new"&gt;content&lt;/div&gt;' }

        it 'diffs escaped HTML tags correctly' do
          expect(result).to eq('&lt;div class="<del class="mod">old</del><ins class="mod">new</ins>"&gt;content&lt;/div&gt;')
        end
      end

Sumu-Ning avatar Apr 25 '25 21:04 Sumu-Ning