protobuf-java-format
protobuf-java-format copied to clipboard
Error when parsing a valid XML file
Hi, I am trying to parse a sample document into a Protobuf Message, using the AddressBook schema from Google examples:
Here is the document:
<AddressBook>
<people>
<name>John Doe</name>
<id>42</id>
<email>[email protected]</email>
</people>
<people>
<name>Jane Doe</name>
<id>41</id>
</people>
</AddressBook>
Here is the code:
// All this initialization stuff is tested
InputStream inputData = XMLMapperTest.class.getResourceAsStream("/data/AddressBook_several.xml");
DynamicSchema schema = SchemaParser.parseSchema(XMLMapperTest.class.getResource("/schemas/AddressBook.desc").getPath(), false);
Descriptors.Descriptor descriptor = schema.getMessageDescriptor("AddressBook");
DynamicMessage.Builder builder = DynamicMessage.newBuilder(descriptor);
XmlFormat xmlFormat = new XmlFormat();
// Here is the instruction that raises the exception
xmlFormat.merge(inputData, StandardCharsets.UTF_8, builder);
Though, I get the following error:
com.googlecode.protobuf.format.ProtobufFormatter$ParseException: 5:21: Expected ">".
at com.googlecode.protobuf.format.XmlFormat$Tokenizer.parseException(XmlFormat.java:619)
at com.googlecode.protobuf.format.XmlFormat$Tokenizer.consume(XmlFormat.java:418)
at com.googlecode.protobuf.format.XmlFormat.consumeClosingElement(XmlFormat.java:680)
at com.googlecode.protobuf.format.XmlFormat.mergeField(XmlFormat.java:764)
at com.googlecode.protobuf.format.XmlFormat.handleObject(XmlFormat.java:882)
at com.googlecode.protobuf.format.XmlFormat.handleValue(XmlFormat.java:775)
at com.googlecode.protobuf.format.XmlFormat.mergeField(XmlFormat.java:755)
at com.googlecode.protobuf.format.XmlFormat.merge(XmlFormat.java:663)
at com.googlecode.protobuf.format.AbstractCharBasedFormatter.merge(AbstractCharBasedFormatter.java:75)
at com.googlecode.protobuf.format.AbstractCharBasedFormatter.merge(AbstractCharBasedFormatter.java:53)
at com.googlecode.protobuf.format.ProtobufFormatter.merge(ProtobufFormatter.java:141)
[...]
I tried with UTF-8 and ISO-8859-1 encoding but I still get the error. Then I tried to remove the dots in the email address in my XML doc and I now parse successfully.
This is the working XML:
<AddressBook>
<people>
<name>John Doe</name>
<id>42</id>
<email>johndoe@examplecom</email>
</people>
<people>
<name>Jane Doe</name>
<id>41</id>
</people>
</AddressBook>
If you want, I can also join the Protobuf schema if you want to try by yourself.
In fact it seems that even that header tag : <?xml version="1.0" encoding="UTF-8"?> causes a parsing error. The same error occurs with UTF-8 characters, such as é, è or à.
I am trying to figure it out but it's quite a pain to read the parser code :p
I also tried to use a String instead of an InputStream (I cannot find a signature corresponding to the example given in the Readme!):
String xml = IOUtils.toString(inputData, StandardCharsets.UTF_8);
xmlFormat.merge(xml, ExtensionRegistry.newInstance(), builder);
But I still get the same error. I printed my xml string and I shows the original one, so the problem comes right from the parsing process.
I identified the error: in fact the nextToken() method from Tokenizer, and more precisely the TOKEN constant which matches the next token is far too restrictive as it allows only a very few values.
For example, dots are not allowed, neither any non-ascii character.
The current Regex is:
extension|[a-zA-Z_\s;@][0-9a-zA-Z_\s;@+-]*+|[.]?[0-9+-][0-9a-zA-Z_.+-]*+|<\/|[\\0-9]++|"([^"
\\]|\\.)*+("|\\?$)|'([^'
\\]|\\.)*+('|\\?$)
Would you mind putting up a fix with a test & update RELEASE-NOTES.md?
Yep I'll try, but I need to find the right regex first, it might not be simple. When I find a solution for sure I'll create a pull request.
Is this change integrated in the master branch ? - I had the same issue and I wonder if the modifications are working - the code simplifies a lot the tokenization regexps.
No it's not merged yet, since simplifying the regex has side effects. In fact I think that the whole parser should be refactored, so I left it as is for the moment, if you find a way to make it work, feel free to fork my repo :)
One thing I don't understand is why they had to code their own XML parser instead of using standard ones ?
I don't know, that's why I gave up trying to debug it, since the whole parser is quite restrictive and does not handle every cases. I think the only clean solution would be to reimplement everything using an existing parser but I did not have time to do it yet.
Hi , I have the same problem. Is this issue still open?