jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Inconsistent Handling Of Duplicate Attributes

Open alycecil opened this issue 3 years ago • 2 comments

Relates to https://github.com/jhy/jsoup/issues/1219 and https://github.com/jhy/jsoup/issues/1234

Desired Behavior

Attribute duplicates other than the first should be ignored.

Scenario Outline: Jsoup handles duplicate consistently 
Given I parse the string "<img src=\"file.png\" name=\"test\" value=\"foo\" type=\"hidden\" value=\"bar\" />" as <From>
When I convert the document to <To>
Then I expect the html to contain "<img src=\"file.png\" name=\"test\" value=\"foo\" type=\"hidden\" />"
 Examples:
 |From|To|
 |HMTL|HTML|
 |HMTL|XML|
 |XML|XML|
 |Jsoup Direct|HTML|
 |Jsoup Direct|XML|

Observed behavior

Dedupe only happens when Parser.xmlParser() consumes the input html.

Example JUNIT

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import org.junit.jupiter.api.Test;

import static org.assertj.core.api.Assertions.assertThat;

public class ConvertTest {
    private static final String HEAD = "<html>\n" +
            " <head></head>\n" +
            " <body>\n" +
            "  ";
    private static final String TRAIL = "\n" +
            " </body>\n" +
            "</html>";
    public static final String DESIRED_ONE_TAG = "<img src=\"file.png\" name=\"test\" value=\"foo\" type=\"hidden\" />";
    public static final String INPUT = "<img src=\"file.png\" name=\"test\" value=\"foo\" type=\"hidden\" value=\"bar\" />";
    public static final String INPUT_NO_SLASH = "<img src=\"file.png\" name=\"test\" value=\"foo\" type=\"hidden\" value=\"bar\">";

    @Test
    void parserXML() {
        String doubleTag = INPUT;
        Parser parser = Parser.xmlParser().setTrackErrors(10);
        Document doc = parser.parseInput(doubleTag, "");

        assertThat(doc.selectFirst("img").outerHtml()).isNotBlank().isEqualTo(DESIRED_ONE_TAG);
    }

    @Test
    void parserHTML() {
        String doubleTag = INPUT;
        Parser parser = Parser.htmlParser().setTrackErrors(10);
        Document doc = parser.parseInput(doubleTag, "");

        assertThat(doc.selectFirst("img").outerHtml()).isNotBlank().isEqualTo(INPUT_NO_SLASH);
    }

    @Test
    void parserXML_toXML() {
        String doubleTag = INPUT;
        Parser parser = Parser.xmlParser().setTrackErrors(10);
        Document doc = parser.parseInput(doubleTag, "");
        doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml);

        assertThat(doc.selectFirst("img").outerHtml()).isNotBlank().isEqualTo(DESIRED_ONE_TAG);
    }

    @Test
    void parserHTML_toXML() {
        String doubleTag = INPUT;
        Parser parser = Parser.htmlParser().setTrackErrors(10);
        Document doc = parser.parseInput(doubleTag, "");
        doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml);

        assertThat(doc.selectFirst("img").outerHtml()).isNotBlank().isEqualTo(INPUT);
    }

    @Test
    void jsoupParseToXML() {
        String doubleTag = INPUT;

        final Document document = Jsoup.parse(doubleTag);
        document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);

        String outputXhtml = document.html()
                .replaceAll("&nbsp;", "&#160;");// nbsp does not exist in xhtml.

        assertThat(outputXhtml).isNotBlank().isEqualTo(HEAD + INPUT + TRAIL);
    }

    @Test
    void jsoupParseToXML_outerMethod() {
        String doubleTag = INPUT;

        final Document document = Jsoup.parse(doubleTag);
        document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);

        String outputXhtml = document.outerHtml()
                .replaceAll("&nbsp;", "&#160;");// nbsp does not exist in xhtml.

        assertThat(outputXhtml).isNotBlank().isEqualTo(HEAD + INPUT + TRAIL);
    }
}

alycecil avatar Feb 10 '22 20:02 alycecil

I along with my group will be fixing this issue in this semester.

QAQGaeBolg avatar Mar 11 '22 13:03 QAQGaeBolg

Hello, @alycecil , our group are interested in your problem and would like to work on this issue. Can we have a try? ——SE 2022 group haha, SUSTech

MVP-D77 avatar Apr 23 '22 09:04 MVP-D77

It's hard to tell from this report under which condition you are seeing duplicate attributes. I executed the testcase and it produces errors, but it's because your expect code contains duplicate attributes, whereas the desired and the actual output dedupes the attributes, both in HTML and XML parses.

Closing as no-repro. If you have a simple example of code that outputs duplicate attributes, please reopen with that.

jhy avatar Oct 24 '23 07:10 jhy