jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Unicode breaks xml serialization

Open lambdaupb opened this issue 4 years ago • 2 comments

The parsed html is clearly weird and broken, but my assumption is that the output, after re-serializing it, should be valid.

  • There are unicode characters in tag names, which does not agree with document.outputSettings().charset("ASCII");

Version: 1.13.1


import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
import java.io.StringReader;

public class Test2 {
    public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
        Document document = Jsoup.parse("  <div id=\"emid\"> <p\u226F\u0322\u0329\u032B\u0320\u0309\u030A\u0366\u0364\u036D\u030A..\u0337\u0359\u036F\u030A\u033D\u0313\u0346\u0309\u036B.\u0347\u032A\u0367\u0305\u0301>\n    &lt; p=\"\"&gt; \n   </p\u226F\u0322\u0329\u032B\u0320\u0309\u030A\u0366\u0364\u036D\u030A..\u0337\u0359\u036F\u030A\u033D\u0313\u0346\u0309\u036B.\u0347\u032A\u0367\u0305\u0301>&lt;&gt; \n  </div> ");
        document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
        document.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
        document.outputSettings().prettyPrint(true);
        document.outputSettings().charset("ASCII");
        String html = document.html();

        System.out.println(html);

        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();

        saxParser.parse(new InputSource(new StringReader(html)), new DefaultHandler() {
            @Override
            public void warning(SAXParseException e) throws SAXException {
                e.printStackTrace();
            }

            @Override
            public void error(SAXParseException e) throws SAXException {
                e.printStackTrace();
            }

            @Override
            public void fatalError(SAXParseException e) throws SAXException {
                e.printStackTrace();
            }
        });

    }
}

output:

<html>
 <head></head>
 <body>
  <div id="emid"> <p≯̢̩̫̠̉̊ͦͤͭ̊..̷͙ͯ̊̽̓͆̉ͫ.͇̪ͧ̅́>
     &lt; p=""&gt; 
   </p≯̢̩̫̠̉̊ͦͤͭ̊..̷͙ͯ̊̽̓͆̉ͫ.͇̪ͧ̅́>&lt;&gt; 
  </div> 
 </body>
</html>
org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 21; Element type "p" must be followed by either attribute specifications, ">" or "/>".
	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at Test2.main(Test2.java:31)
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 21; Element type "p" must be followed by either attribute specifications, ">" or "/>".
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
	at Test2.main(Test2.java:31)

Process finished with exit code 1

lambdaupb avatar Feb 24 '21 11:02 lambdaupb

Hi, we are a student group and we would like to take a crack at this. Can't guarantee that we'll be able to complete it with high enough quality but we'll like to try.

LIKP0 avatar Mar 05 '21 03:03 LIKP0

Hello! I think there is no error with document.outputSettings().charset("ASCII"); You can look for an online Unicode translator and try "\u226F\u0322\u0329\u032B\u0320\u0309\u030A", then you can see that it do translate it into "≯̢̩̫̠̉̊". By the way, unicode like "\u226F" has no correspoding ASCII character. You can try below code which proves the correctness of jsoup.

Document document = Jsoup.parse("\u0041\u0042\u0043"); //ABC
document.outputSettings().charset("ASCII");
String html = document.html();
System.out.println(html);

LIKP0 avatar Apr 16 '21 06:04 LIKP0