lucene icon indicating copy to clipboard operation
lucene copied to clipboard

LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Open elliotzlin opened this issue 2 years ago • 8 comments

Description

"<" and ">" characters are acceptable in attribute values per the HTML5 spec. See #11556 for more details.

elliotzlin avatar Aug 28 '22 06:08 elliotzlin

This change looks good to me! I just want to clone the branch and try to figure how you removed 25,000 lines of code from the generated jflex :)

rmuir avatar Aug 29 '22 19:08 rmuir

What?! How?! :) Seems like the automaton got a whole lot smaller:

   private static int [] zzUnpackAttribute() {
-    int [] result = new int[14794];
+    int [] result = new int[3082]; 

dweiss avatar Aug 30 '22 06:08 dweiss

I hadn't even noticed that! I hope it's not some JFlex side-effect that I unwittingly triggered with my change.

elliotzlin avatar Aug 30 '22 18:08 elliotzlin

I don't think you did anything wrong. The changes look fine, it is just a real puzzler.

I even cloned your repo and inspected the enormous diff to the generated code and I only see smaller DFAs, not any "logical changes". The test suite does pass.

rmuir avatar Aug 30 '22 21:08 rmuir

@rmuir @dweiss I wanted to follow up on this. Is there anything I should work on here or is this good to go?

elliotzlin avatar Sep 13 '22 05:09 elliotzlin

Sorry this slipped, @elliotzlin . I think it's fine. Please add a lucene/CHANGES.txt entry (it'll go towards 9.5.0). I'll handle the commit and backport.

dweiss avatar Sep 13 '22 06:09 dweiss

@dweiss can you help run the workflows again?

elliotzlin avatar Sep 26 '22 00:09 elliotzlin

@rmuir would you be able to help with running the workflows?

elliotzlin avatar Sep 27 '22 17:09 elliotzlin

@dweiss @rmuir bumping this old thread again.

elliotzlin avatar Nov 22 '22 06:11 elliotzlin

@dweiss @rmuir apologies for reviving this ancient thread, but I was wondering if one of you can help run the PR checks again?

elliotzlin avatar Sep 23 '23 19:09 elliotzlin

I will take a look tomorrow - was out of office last week.

dweiss avatar Oct 01 '23 20:10 dweiss

I believe I've found a regression with this bugfix. A test case that exposes:

  public void testForIssue10520Regression() throws IOException {
    String test =
        "<!DOCTYPE html><html lang=\"en\"><head><title>Test</title></head><a href=\"https://www.somewhere.com?data=\">a link</a> some text <a href=\"https://www.elsewhere.com\">another link</a></html>";
    Reader reader = new StringReader(test);
    HTMLStripCharFilter filter = new HTMLStripCharFilter(reader);
    StringWriter result = new StringWriter();
    filter.transferTo(result);
    assertEquals("Test\n\na link some text another link", result.toString().trim());
  }

The problem is with the empty data= parameter at the end of the first url. We see a few of those in our document set and that is how I noticed.

mjustice3 avatar Mar 04 '24 21:03 mjustice3