lucene
lucene copied to clipboard
LUCENE-10520 / #11556 HTMLStripCharFilter bugfix
Description
"<" and ">" characters are acceptable in attribute values per the HTML5 spec. See #11556 for more details.
This change looks good to me! I just want to clone the branch and try to figure how you removed 25,000 lines of code from the generated jflex :)
What?! How?! :) Seems like the automaton got a whole lot smaller:
private static int [] zzUnpackAttribute() {
- int [] result = new int[14794];
+ int [] result = new int[3082];
I hadn't even noticed that! I hope it's not some JFlex side-effect that I unwittingly triggered with my change.
I don't think you did anything wrong. The changes look fine, it is just a real puzzler.
I even cloned your repo and inspected the enormous diff to the generated code and I only see smaller DFAs, not any "logical changes". The test suite does pass.
@rmuir @dweiss I wanted to follow up on this. Is there anything I should work on here or is this good to go?
Sorry this slipped, @elliotzlin . I think it's fine. Please add a lucene/CHANGES.txt entry (it'll go towards 9.5.0). I'll handle the commit and backport.
@dweiss can you help run the workflows again?
@rmuir would you be able to help with running the workflows?
@dweiss @rmuir bumping this old thread again.
@dweiss @rmuir apologies for reviving this ancient thread, but I was wondering if one of you can help run the PR checks again?
I will take a look tomorrow - was out of office last week.
I believe I've found a regression with this bugfix. A test case that exposes:
public void testForIssue10520Regression() throws IOException {
String test =
"<!DOCTYPE html><html lang=\"en\"><head><title>Test</title></head><a href=\"https://www.somewhere.com?data=\">a link</a> some text <a href=\"https://www.elsewhere.com\">another link</a></html>";
Reader reader = new StringReader(test);
HTMLStripCharFilter filter = new HTMLStripCharFilter(reader);
StringWriter result = new StringWriter();
filter.transferTo(result);
assertEquals("Test\n\na link some text another link", result.toString().trim());
}
The problem is with the empty data=
parameter at the end of the first url. We see a few of those in our document set and that is how I noticed.