Inconsistent Handling Of Duplicate Attributes
Relates to https://github.com/jhy/jsoup/issues/1219 and https://github.com/jhy/jsoup/issues/1234
Desired Behavior
Attribute duplicates other than the first should be ignored.
Scenario Outline: Jsoup handles duplicate consistently
Given I parse the string "<img src=\"file.png\" name=\"test\" value=\"foo\" type=\"hidden\" value=\"bar\" />" as <From>
When I convert the document to <To>
Then I expect the html to contain "<img src=\"file.png\" name=\"test\" value=\"foo\" type=\"hidden\" />"
Examples:
|From|To|
|HMTL|HTML|
|HMTL|XML|
|XML|XML|
|Jsoup Direct|HTML|
|Jsoup Direct|XML|
Observed behavior
Dedupe only happens when Parser.xmlParser() consumes the input html.
Example JUNIT
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import org.junit.jupiter.api.Test;
import static org.assertj.core.api.Assertions.assertThat;
public class ConvertTest {
private static final String HEAD = "<html>\n" +
" <head></head>\n" +
" <body>\n" +
" ";
private static final String TRAIL = "\n" +
" </body>\n" +
"</html>";
public static final String DESIRED_ONE_TAG = "<img src=\"file.png\" name=\"test\" value=\"foo\" type=\"hidden\" />";
public static final String INPUT = "<img src=\"file.png\" name=\"test\" value=\"foo\" type=\"hidden\" value=\"bar\" />";
public static final String INPUT_NO_SLASH = "<img src=\"file.png\" name=\"test\" value=\"foo\" type=\"hidden\" value=\"bar\">";
@Test
void parserXML() {
String doubleTag = INPUT;
Parser parser = Parser.xmlParser().setTrackErrors(10);
Document doc = parser.parseInput(doubleTag, "");
assertThat(doc.selectFirst("img").outerHtml()).isNotBlank().isEqualTo(DESIRED_ONE_TAG);
}
@Test
void parserHTML() {
String doubleTag = INPUT;
Parser parser = Parser.htmlParser().setTrackErrors(10);
Document doc = parser.parseInput(doubleTag, "");
assertThat(doc.selectFirst("img").outerHtml()).isNotBlank().isEqualTo(INPUT_NO_SLASH);
}
@Test
void parserXML_toXML() {
String doubleTag = INPUT;
Parser parser = Parser.xmlParser().setTrackErrors(10);
Document doc = parser.parseInput(doubleTag, "");
doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
assertThat(doc.selectFirst("img").outerHtml()).isNotBlank().isEqualTo(DESIRED_ONE_TAG);
}
@Test
void parserHTML_toXML() {
String doubleTag = INPUT;
Parser parser = Parser.htmlParser().setTrackErrors(10);
Document doc = parser.parseInput(doubleTag, "");
doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
assertThat(doc.selectFirst("img").outerHtml()).isNotBlank().isEqualTo(INPUT);
}
@Test
void jsoupParseToXML() {
String doubleTag = INPUT;
final Document document = Jsoup.parse(doubleTag);
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
String outputXhtml = document.html()
.replaceAll(" ", " ");// nbsp does not exist in xhtml.
assertThat(outputXhtml).isNotBlank().isEqualTo(HEAD + INPUT + TRAIL);
}
@Test
void jsoupParseToXML_outerMethod() {
String doubleTag = INPUT;
final Document document = Jsoup.parse(doubleTag);
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
String outputXhtml = document.outerHtml()
.replaceAll(" ", " ");// nbsp does not exist in xhtml.
assertThat(outputXhtml).isNotBlank().isEqualTo(HEAD + INPUT + TRAIL);
}
}
I along with my group will be fixing this issue in this semester.
Hello, @alycecil , our group are interested in your problem and would like to work on this issue. Can we have a try? ——SE 2022 group haha, SUSTech
It's hard to tell from this report under which condition you are seeing duplicate attributes. I executed the testcase and it produces errors, but it's because your expect code contains duplicate attributes, whereas the desired and the actual output dedupes the attributes, both in HTML and XML parses.
Closing as no-repro. If you have a simple example of code that outputs duplicate attributes, please reopen with that.