jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Form elements not updated when making DOM changes

Open aditsu opened this issue 1 year ago • 1 comments

I am parsing an html document and making some changes to it, then extracting some form fields. It looks like the elements in FormElement are already set when the document is initially parsed, and remain unchanged. Example code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.FormElement;

public class ParseTest {
	public static void main(final String... args) {
		final String html = "<html><body><form><div id=d1><input id=foo></div><input id=bar></form></body></html>";
		final Document doc = Jsoup.parse(html);
		doc.select("#d1").remove();
		final FormElement form = (FormElement) doc.selectFirst("form");
		form.appendElement("select").attr("id", "baz");
		System.out.println(form);
		System.out.println();
		System.out.println(form.elements());
	}
}

The result shows that the form now has the bar and baz fields, but its elements are still foo and bar. Notice that foo was removed even before selecting the form element.

Workaround (re-parsing): form = (FormElement) Jsoup.parse(form.toString()).selectFirst("form"); Jsoup version: 1.17.2

aditsu avatar May 30 '24 06:05 aditsu

Possible solution:

  • make elements non-final
  • when making any DOM change, traverse the ancestors of affected nodes and set any FormElement's elements to null (affects performance?)
  • update the elements() method to reconstruct the elements field if null
  • update any other internal uses of elements to check for null or call elements() instead

OR...

...just add a method to reconstruct the elements of a form (will have to be called explicitly)

aditsu avatar May 30 '24 07:05 aditsu

Yes, the form control elements in FormElement.elements() are specifically associated during the parse; they are not a runtime property.

I think if they are required, a reconstruction method would be the most appropriate. I would be happy to review a PR if you want to add it.

But, can you tell me what you are actually using those elements for? There may be a simpler solution, or a more ergonomic change we can make.

jhy avatar Jul 01 '24 06:07 jhy

But, can you tell me what you are actually using those elements for? There may be a simpler solution, or a more ergonomic change we can make.

For making automated HTTP requests to a website (as in submitting a form). I'm parsing all the elements as they come and just setting the values I need. Some forms are modified dynamically in the browser and I need to mirror those modifications.

aditsu avatar Jul 08 '24 07:07 aditsu

Thanks for confirming. That's definitely the main point of the FormElement, so we should make sure its use is ergonomic.

The main reason that we have the linking running at parse time is so that form element children that are moved out from the containing form tag during the parse are still associated with the form. But that doesn't mean that elements added dynamically ought not be associated as well.

Some rough thoughts:

Associating the new elements with the form automatically would be convenient, but may have a bit of an overhead and a low hit ratio. Linking it to the retrieval of the form would be ideal, but in your example it's just a plain cast so we don't have that opportunity. We do if the user is calling it via Elements#forms().

Maybe we can add a Element#asForm() method that runs a re-connect and simplifies the cast.

But in review of the code and use. I think it's best to reimplement FormElement#elements() to rebuild on demand and include the existing elements plus any others found as children.

jhy avatar Jul 15 '24 06:07 jhy

Thanks, done. @aditsu it would be great if you could review and test (do a mvn install from head), and LMK.

jhy avatar Jul 15 '24 07:07 jhy