ppx_protocol_conv icon indicating copy to clipboard operation
ppx_protocol_conv copied to clipboard

How can I deserialize a `<xs:choice>` XML element into an OCaml variant type?

Open ahelwer opened this issue 1 year ago • 8 comments

This is a question, not an issue. I am trying to use this library to parse an XML format I do not have control over. Suppose it has an XSD as follows:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="modules">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="RootModule" type="xs:string"/>
        <xs:sequence maxOccurs="unbounded">
          <xs:choice>
            <xs:element ref="ModuleNode"/>
            <xs:element ref="ModuleNodeRef"/>
          </xs:choice>
        </xs:sequence>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Here is an example XML input:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<modules>
  <RootModule>Test</RootModule>
  <ModuleNodeRef>
    <UID>304</UID>
  </ModuleNodeRef>
  <ModuleNode>
    <uniquename>Test</uniquename>
  </ModuleNode>
</modules>

I would like to deserialize this into something similar to the following, where the ModuleNodeRef and ModuleNode instances are a list of variant types:

open Protocol_conv_xml

type module_node = {
  uniquename : string
} [@@deriving protocol ~driver:(module Xml_light)]

type module_node_ref = {
  uid : int [@key "UID"]
} [@@deriving protocol ~driver:(module Xml_light)]

type module_node_or_ref =
  | ModuleNode of module_node [@key "ModuleNode"]
  | ModuleNodeRef of module_node_ref [@key "ModuleNodeRef"]
[@@deriving protocol ~driver:(module Xml_light)]

type modules = {
  root_module: string; [@key "RootModule"]
  modules : module_node_or_ref list;
} [@@deriving protocol ~driver:(module Xml_light)]

let () =
  let xml = Xml.parse_file "Test.xml" in
  match modules_of_xml_light xml with
  | Ok s -> s |> modules_to_xml_light |> Xml.to_string_fmt |> Printf.printf "t serialized: %s\n";
  | Error e -> print_endline (Xml_light.error_to_string_hum e)

However, when I run the program on the example input it just produces:

t serialized: <record>             
  <RootModule>Test</RootModule>
</record>

So the variants are ignored. How can I correctly parse the variants? Thanks!

ahelwer avatar Nov 07 '24 19:11 ahelwer

As a follow-on, if I start from a variant type and try to round-trip to XML I get an error; here is the code:

open Protocol_conv_xml

type foo = {
  name : string [@key "name"]
}
[@@deriving protocol ~driver:(module Xml_light)]

type bar = {
  value : int
}
[@@deriving protocol ~driver:(module Xml_light)]

type t =
  | A of foo [@key "A"]
  | B of bar [@key "B"]
[@@deriving protocol ~driver:(module Xml_light)]

let () = 
  let value = A {name = "test"} in
  let xml = to_xml_light value in
  Printf.printf "t serialized: %s\n" (Xml_light.to_string xml);
  ()

This produces an error when run:

Fatal error: exception Primitive value expected in node: variant for string. T: '<variant>
A
  <record>
    <name>test</name>
  </record>
</variant>'
Raised at Protocol_conv_xml__Xml_light.raise_errorf.(fun) in file "drivers/xml_light/xml_light.ml", line 30, characters 34-63
Called from Protocol_conv_xml__Xml_light.to_value in file "drivers/xml_light/xml_light.ml", line 193, characters 38-120
Called from Protocol_conv_xml__Xml_light.to_string in file "drivers/xml_light/xml_light.ml" (inlined), line 214, characters 16-46
Called from Dune__exe__Main in file "bin/main.ml", line 21, characters 37-62

ahelwer avatar Nov 07 '24 20:11 ahelwer

Thanks for the report.

ppx_protocol_conv_xml_light does not handle schemas, but only looks at various nodes in the xml.

I will need to test a bit furhter, but to parse the example given, I think you should create two sequences - one for ModuleNode and one for ModuleNodeRef:

type module_node = {
  uniquename : string
} [@@deriving protocol ~driver:(module Xml_light)]

type module_node_ref = {
  uid : int [@key "UID"]
} [@@deriving protocol ~driver:(module Xml_light)]

type modules = {
  root_module: string; [@key "RootModule"]
  module_nodes:  module_node list; [@key "ModuleNode"]
  module_node_refs: module_node_ref list; [@key "ModuleNodeRefs"]
} [@@deriving protocol ~driver:(module Xml_light)]

andersfugmann avatar Nov 08 '24 14:11 andersfugmann

As a follow-on, if I start from a variant type and try to round-trip to XML I get an error; here is the code:

open Protocol_conv_xml

type foo = {
  name : string [@key "name"]
}
[@@deriving protocol ~driver:(module Xml_light)]

type bar = {
  value : int
}
[@@deriving protocol ~driver:(module Xml_light)]

type t =
  | A of foo [@key "A"]
  | B of bar [@key "B"]
[@@deriving protocol ~driver:(module Xml_light)]

let () = 
  let value = A {name = "test"} in
  let xml = to_xml_light value in
  Printf.printf "t serialized: %s\n" (Xml_light.to_string xml);
  ()
This produces an error when run:

Fatal error: exception Primitive value expected in node: variant for string. T: ' A test ' Raised at Protocol_conv_xml__Xml_light.raise_errorf.(fun) in file "drivers/xml_light/xml_light.ml", line 30, characters 34-63 Called from Protocol_conv_xml__Xml_light.to_value in file "drivers/xml_light/xml_light.ml", line 193, characters 38-120 Called from Protocol_conv_xml__Xml_light.to_string in file "drivers/xml_light/xml_light.ml" (inlined), line 214, characters 16-46 Called from Dune__exe__Main in file "bin/main.ml", line 21, characters 37-62

Xml_light.to_string is an internal function to parse a node value as a string. You should use Xml.to_string_fmt as in your first example.

andersfugmann avatar Nov 08 '24 14:11 andersfugmann

I will need to test a bit furhter, but to parse the example given, I think you should create two sequences - one for ModuleNode and one for ModuleNodeRef

That is what I did for this case, but this was really a simplified example for a much more complicated example later on. This is basically the XML output of a parser and one of the nodes is an <entry> node that can be one of many possible node types. Here is the relevant section of the XSD:

https://github.com/tlaplus/tlaplus/blob/17e5d773b05e492311d3f3cc754e695b6f43b5ee/tlatools/org.lamport.tlatools/src/tla2sany/xml/sany.xsd#L28-L51

ahelwer avatar Nov 08 '24 14:11 ahelwer

TL;DR I don't think ppx_protocol_conv will be able to do what you want.

As I understand it, you want a record containing a algebraic datatype (<xs:choice>) and other fields to be "inlined": e.g.

type value = A of int
                   | B of string

type elem = {
  field: int;
  value: value;
}
  
type record = {
  data: elem list;
}

to serialize to:

<record>
   <field>1</field>
   <A>2</A>
   <field>2</field>
   <B>two</B>
   ...
</record>

the problem is that ppx_protocol_conv needs to have a name for each field. In this case the field value is not present, so it's not possible for the framework to construct a deserializer for the value field (or for the framework to know what value to deserialize for the field value.

andersfugmann avatar Nov 09 '24 22:11 andersfugmann

This seems like a fairly common pattern in XML that is certainly supported by XSD, so do you think it would be worth supporting with like an [@Inline] tag on the field or something? Wish that OCaml supported anonymous variants.

ahelwer avatar Nov 10 '24 14:11 ahelwer

It will be quite difficult to support with the current architecture. Consider deserialization: A sequence is defined as a list of elements of type t. For each element in the list, the de-serialization function is called and returns a type t. If the sequence is "inlined", then the bounds is less clear, and the de-serialization function will need to consume elements from a stream. This is a rather large change to the underlying infrastructure.

Mapping constructores to node names is less disruptive, and could potentially be done - but then some care must be taken to avoid name clashes in e.g.

type adt = A of int | B of string
type t = { adt: adt; 
                 a: int; [@key "A"]
} 

I think I'll reopen the issue, and see if I can create an alternative driver that can support (some of) these features.

andersfugmann avatar Nov 13 '24 12:11 andersfugmann

I would say the sequence of variants use case is less important than just inline variants generally, since sequences of variants can always be turned into fields of empty or nonempty sequences. Whereas if you have:

type adt = A of int | B of string
type t = {
  adt: adt; 
  uid: int;
} 

then I think this would only accept inputs like:

<t>
  <uid>5</uid>
  <adt>
    <A>6</A>
  </adt>
</t>

whereas it seems much more common in XML to omit the name of the variant datatype and just use the names of the specific variants inline, so either:

<t>
  <uid>5</uid>
  <A>6</A>
</t>

or:

<t>
  <uid>5</uid>
  <B>foo</B>
</t>

ahelwer avatar Nov 13 '24 15:11 ahelwer