jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Negative source range start/end positions for first text node

Open KennyWongPFPT opened this issue 1 year ago • 1 comments

Hello,

import org.jsoup.nodes.*;
import org.jsoup.parser.*;
import org.jsoup.select.*;

public class Test {
    public static void main(String[] args) {
        HtmlTreeBuilder treeBuilder = new HtmlTreeBuilder();
        Parser parser = new Parser(treeBuilder);
        parser.setTrackPosition(true);
        Document document = parser.parseInput("foo<p></p>bar<p></p><div><b>baz</b></div>", "");
        NodeTraversor.traverse((Node node, int depth) -> {
            if (node instanceof TextNode textNode) {
                Range sourceRange = textNode.sourceRange();
                System.out.printf("text=%s start=%d end=%d%n",
                    textNode.text(),
                    sourceRange.start().pos(),
                    sourceRange.end().pos());
            }
        }, document);
    }
}

We seeing -ve start/end positions for the source range of the first text node foo, for example using release 1.16.1:

java -cp ~/.m2/repository/org/jsoup/jsoup/1.16.1/jsoup-1.16.1.jar Test.java
text=foo start=-1 end=-1
text=bar start=10 end=13
text=baz start=28 end=31

Release 1.17.2 has the end position correct, but the start is still -1

java -cp ~/.m2/repository/org/jsoup/jsoup/1.17.2/jsoup-1.17.2.jar Test.java
text=foo start=-1 end=3
text=bar start=10 end=13
text=baz start=28 end=31

KennyWongPFPT avatar Jan 19 '24 17:01 KennyWongPFPT

@KennyWongPFPT Is it possible that in Parser.java, the following might be causing the issue?

public static Document parseBodyFragment(String bodyHtml, String baseUri) { Document doc = Document.createShell(baseUri); Element body = doc.body(); List<Node> nodeList = parseFragment(bodyHtml, body, baseUri); Node[] nodes = nodeList.toArray(new Node[0]); // the node list gets modified when re-parented

    for (int i = nodes.length - 1; i > 0; i--) {
         nodes[i].remove();
    }
    for (Node node : nodes) {
        body.appendChild(node);
    }
    return doc;}

I'm not trying to take a wild stab in the dark, but the HTML string you're passing doesn't contain an initial tag, so potentially setting the start to -1. If there's a check in place, I'm wondering if this will rectify the issue.

MasterChiefNemo avatar Feb 06 '24 10:02 MasterChiefNemo

Thanks, fixed!

jhy avatar Jul 01 '24 05:07 jhy