Form elements not updated when making DOM changes
I am parsing an html document and making some changes to it, then extracting some form fields. It looks like the elements in FormElement are already set when the document is initially parsed, and remain unchanged.
Example code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.FormElement;
public class ParseTest {
public static void main(final String... args) {
final String html = "<html><body><form><div id=d1><input id=foo></div><input id=bar></form></body></html>";
final Document doc = Jsoup.parse(html);
doc.select("#d1").remove();
final FormElement form = (FormElement) doc.selectFirst("form");
form.appendElement("select").attr("id", "baz");
System.out.println(form);
System.out.println();
System.out.println(form.elements());
}
}
The result shows that the form now has the bar and baz fields, but its elements are still foo and bar. Notice that foo was removed even before selecting the form element.
Workaround (re-parsing): form = (FormElement) Jsoup.parse(form.toString()).selectFirst("form");
Jsoup version: 1.17.2
Possible solution:
- make
elementsnon-final - when making any DOM change, traverse the ancestors of affected nodes and set any
FormElement'selementsto null (affects performance?) - update the
elements()method to reconstruct theelementsfield if null - update any other internal uses of
elementsto check for null or callelements()instead
OR...
...just add a method to reconstruct the elements of a form (will have to be called explicitly)
Yes, the form control elements in FormElement.elements() are specifically associated during the parse; they are not a runtime property.
I think if they are required, a reconstruction method would be the most appropriate. I would be happy to review a PR if you want to add it.
But, can you tell me what you are actually using those elements for? There may be a simpler solution, or a more ergonomic change we can make.
But, can you tell me what you are actually using those elements for? There may be a simpler solution, or a more ergonomic change we can make.
For making automated HTTP requests to a website (as in submitting a form). I'm parsing all the elements as they come and just setting the values I need. Some forms are modified dynamically in the browser and I need to mirror those modifications.
Thanks for confirming. That's definitely the main point of the FormElement, so we should make sure its use is ergonomic.
The main reason that we have the linking running at parse time is so that form element children that are moved out from the containing form tag during the parse are still associated with the form. But that doesn't mean that elements added dynamically ought not be associated as well.
Some rough thoughts:
Associating the new elements with the form automatically would be convenient, but may have a bit of an overhead and a low hit ratio. Linking it to the retrieval of the form would be ideal, but in your example it's just a plain cast so we don't have that opportunity. We do if the user is calling it via Elements#forms().
Maybe we can add a Element#asForm() method that runs a re-connect and simplifies the cast.
But in review of the code and use. I think it's best to reimplement FormElement#elements() to rebuild on demand and include the existing elements plus any others found as children.
Thanks, done. @aditsu it would be great if you could review and test (do a mvn install from head), and LMK.