feat(adapter): rewrite XMLAdapter for nested-data support
Closes #8481
TL;DR
- Replaces regex parsing with
xml.etree.ElementTree. - Supports nested Pydantic models, repeated tags →
List, mixed data types. - Keeps all existing flat-structure behaviour (no breaking changes expected).
Motivation
XMLAdapter failed on any hierarchical XML (see #8481). Users were forced to switch to JSONAdapter, losing the readability benefits of XML. This PR brings feature-parity with JSONAdapter.
What changed
1. Parsing & Formatting
| Area | Old | New |
|---|---|---|
| Parsing engine | Regex <(\w+)>(.*?)</\1> |
ElementTree traversal |
| Formatting | JSON-in-XML | Canonical nested XML |
| Error handling | Bare exceptions | Explicit AdapterParseError with context |
2. New helpers
_xml_to_dict(element) → Any_dict_to_xml(data, tag) → str
3. Removed
_parse_field_value()– superseded by full XML mapping.
Backwards compatibility
- Flat structures behave exactly as before (all original tests still pass).
- No API signature changes; only internal behaviour differs.
Example (was failing, now passes)
class Address(BaseModel):
street: str; city: str
class Person(BaseModel):
name: str; age: int; address: Address
class Sig(dspy.Signature):
text: str = dspy.InputField()
person: Person = dspy.OutputField()
xml_out = """
<person><name>John</name><age>30</age>
<address><street>Main</street><city>NYC</city></address>
</person>
"""
assert dspy.XMLAdapter(Sig()).parse(xml_out).person.name == "John"
Tests added
- appropriate tests have been added at : tests/adapters/test_xml_adapter.py
Risks / limitations
- ElementTree does not preserve attribute order; irrelevant for our use-case but worth noting.
- Doesn’t yet support XML attributes
(<tag attr="…">)
Happy to engage in conversations, and grateful for this chance to contribute to DSPy.
@chenmoneygithub : Thanks for your quick response!
Regarding performance of XML vs JSON
- In a lot of our internal tests, we have found that nesting XMLs and adhering to fully XML output structure actually work better than JSON is many contexts. Especially when we want to add thinking/ CoT process to an existing problem. This idea came through following anthropic's prompting guide and we were also pleasantly surprised.
- Even if we assume that JSON is strictly superior to XML, then we possibly don't wish to mix XML and JSON as its happening now in XMLAdapter. This would only confuse the model. In case people want to use JSON, then can already use JSONAdapter. If we are giving an option for XMLParsing, we might as well give a proper implementation.
- We wanted to port to DSPy for a more structured way for our genai application and a recurrent pattern we saw in our prompts across usecases is that we have xml based prompts. This is why we wanted to implement these changes to unblock rest of our teams.
Regarding Instruction to Prompt LM to generate nested XML.
- This is solved in our usecase as soon as you include even 1 example input/output in fewshot prompting.
- I understand that this might not be the usecase for everyone and thank you for the feedback. Including a dynamic prompt based on the output field which also recursively explains the nested structure should be straightforward. I'll do the necessary changes either tonight post work or tomorrow EOD.
Thanks once again for your time in going through the PR and recommending changes.
@okhat : Could you please give a green light for me to work further on this PR? I believe that a proper XML handling adapter would be a great value add for many usecases including ours.
Thank you for DSPy!
@Bhuvanesh09 This is something pretty interesting while complex, if you are interested in this path, here are some guidelines:
- Pick 1-2 datasets and 3-5 models, and report the benchmark score between the current XMLAdapter and your proposed XMLAdapter.
- Share a github gist/colab/databricks notebook link to your benchmark script.
- A few screenshots or example to show that LM can output response in xml format correctly.
We have seen that mixing XML and JSON doing all right, so to avoid causing regression we need to collect numerical evidence that strict XML is beneficial.
This is very interesting, @Bhuvanesh09 . Thanks @chenmoneygithub for discussing it with @Bhuvanesh09 .
Q: How will this handle Lists?
This is very interesting, @Bhuvanesh09 . Thanks @chenmoneygithub for discussing it with @Bhuvanesh09 .
Q: How will this handle Lists?
Hi @okhat!
I'm currently working on collecting evidence to support my claims by curating a small custom dataset. Trying for minimal problems that focus on extracting structured information from natural language.
How lists are handled in this case.
Our XMLAdapter handles lists through repeated XML elements, which is the standard XML approach:
1. List Definition in Signature
class TaskList(dspy.Signature):
topic: str = dspy.InputField(desc="Topic to generate tasks for")
tasks: list[str] = dspy.OutputField(desc="List of 5 specific tasks")
2. Expected XML Output Structure
<tasks>Task 1 description</tasks>
<tasks>Task 2 description</tasks>
<tasks>Task 3 description</tasks>
...
3. XML Attributes Don't Affect Parsing
Our parser also correctly handles XML attributes like id="1", id="2" and ignores them during parsing:
<tasks id="1">Task 1 description</tasks>
<tasks id="2">Task 2 description</tasks>
<tasks id="3">Task 3 description</tasks>
Side Note: Earlier when we didn't use DSPy, making the LLM add attributes to the tags like above actually helped the model to adhere to the number of outputs we ask it, since it is implictly able to keep track. For instance, when we ask it to generate 5 summary points, then even weaker models are able to be more consistent in giving 5 summary points. It might be in future scope to add this to parser's prompt to suggest models to do the same within DSPy.
Since the fruitful discussion with @chenmoneygithub, I've made changes to the code and included better instructions for output formatting. In my very early experiments, the results look good for this new parser but I'm yet to compare it with the older one.
Hoping to wrap these experiments this weekend and share updates soon!
Hi @chenmoneygithub and @okhat,
Thank you for the discussion on this PR. I've conducted a series of experiments to provide data-driven evidence for the proposed changes, focusing on how different models interact with the adapters.
TL;DR: The current XMLAdapter uses a complex "JSON-in-XML" prompting strategy that is only consistently understood by larger, more capable models (e.g., Qwen 4B). My ImprovedXMLParser uses a simpler, direct prompt for pure nested XML, making it friendlier and more reliable for smaller models. This results in a 100% parsing rate for the improved adapter across all tested models, while the legacy adapter's parsing rate was as low as 0-10% for models under 4B parameters.
The full experiment notebook and dataset are available for complete reproducibility: <gist_link>
The Experiment: Testing Adapter Robustness
My experiment was designed to test how effectively each adapter could elicit correct, structured output from various language models.
- Dataset: The experiment uses the
person_dataset.csvdataset. - Task: The task is to extract a person's name and their nested address (containing city and country) from a natural language sentence and format it into a consistent, nested XML structure.
The design rationale was to decouple the model's core NLU capabilities from its ability to adhere to a specific formatting schema. This allows us to see if a failure is due to the model not understanding the text or the adapter not providing clear instructions.
Experimental Results
The results clearly show that the ImprovedXMLParser is more reliable across a wider range of model sizes.
Parsing Accuracy (%)
The ImprovedXMLParser's direct prompting leads to a 100% parsing rate. The legacy adapter's more complex prompts are only consistently parsed when used with the 4B parameter model.
| Model | Legacy XMLAdapter | ImprovedXMLParser |
|---|---|---|
| Qwen 3: 0.6B | 10.00% | 100.00% |
| Llama 3.2: 3B | 0.00% | 100.00% |
| Qwen 3: 1.7B | 10.00% | 100.00% |
| Qwen 3: 4B | 100.00% | 100.00% |
Exact Accuracy (%)
The improved adapter's clarity also leads to higher final accuracy for the smaller models.
| Model | Legacy XMLAdapter | ImprovedXMLParser |
|---|---|---|
| Qwen 3: 0.6B | 10.00% | 85.00% |
| Llama 3.2: 3B | 0.00% | 90.00% |
| Qwen 3: 1.7B | 10.00% | 100.00% |
| Qwen 3: 4B | 100.00% | 95.00% |
The "Why": Prompt Complexity vs. Clarity
The difference in performance comes down to prompt complexity.
Current Legacy Adapter's behaviour:
The legacy adapter requires a high level of instruction-following capability by asking for a JSON object inside an XML tag. Only the strongest model tested (Qwen 4B) could handle this reliably.
<address>
{address} # note: the value you produce must adhere to the JSON schema: {"type": "object", ...}
</address>
Example Failure:
(image taken from the notebook linked in the gist)
The prompt itself to the Old Adapter was of the form:
New Adapter's behaviour:
The ImprovedXMLParser lowers the barrier to entry. It provides simple, direct instructions to generate pure, nested XML, a task that smaller models can easily accomplish.
Example inspect_history of even the weakest model i.e. Qwen 0.6B being able to follow it:
Conclusion
The ImprovedXMLParser makes the XML feature more robust and accessible, especially for developers using smaller, more efficient models. By simplifying the instructions, it ensures reliable structured output without requiring a 4B+ parameter model. This change solves a key usability issue and makes the feature work as expected across a broader ecosystem of LLMs. Note that none of these models were explicitly trained to give out nested XML structure. Even if these didn't perform as well as they have, there is a possible point to be made that having a consistent xml nest structure would allow people to finetune focused model which have the same ability. Then, it becomes a matter of choice whether they want to use nested XML, or structured JSON. Providing a clear, single-format approach for either nested XML or structured JSON may be more straightforward for both fine-tuning and zero-shot prompting, especially for smaller LMs.
I'm happy to discuss this further and make any additional changes.
Thanks a lot for taking the time to go through my PR.
Hi @chenmoneygithub and @okhat,
Hope you're having a good week.
I'm just checking in on this pull request to see if you've had a chance to review the experimental data. I wanted to make sure the proposed direction makes sense and see if there is anything else I can do to help move this forward.
No rush at all, just wanted to put the ball back in my court if there's anything else needed.
Thanks!
@Bhuvanesh09 Thanks for putting up the result and reproducible code! This is a complex one, we will find some time to go through it carefully and loop back to you.
@chenmoneygithub : Great, thank you for the update! Please let me know if any questions come up during the review. Looking forward to your feedback.
Hi @chenmoneygithub, @okhat, just wanted to bring this PR back to your attention in case it's been buried. No pressure on the timeline at all, just wanted to make sure it hasn't fallen through the cracks. If you want any additional experiments to help you make the decision on its utility, I'd be happy to do them.
Yes, I'd love for this to be considered as there are plenty of great use cases I can think of for this. And nested XML is the norm (LLMs seem to like that a lot, too).
@isaacbmiller @chenmoneygithub
Quick ping on this PR, which has been cited in a few XMLAdapter issues. If you have a preferred direction for XMLAdapter, please share it. I’m happy to align and drive this to completion.
Parallelly, in our internal fork we:
- added an
lxmldependency to robustly handle ampersands in model output (&and&), avoiding ElementTree parse failures; - made QoL improvements to list parsing and related helpers.
If lxml is acceptable, I can port these changes here. If not, we can proceed with your vision for this.
Could you share a rough review window so I can plan time on my side?
Thanks!