protobuf-java-format icon indicating copy to clipboard operation
protobuf-java-format copied to clipboard

Error when parsing a valid XML file

Open whiver opened this issue 7 years ago • 9 comments

Hi, I am trying to parse a sample document into a Protobuf Message, using the AddressBook schema from Google examples:

Here is the document:

<AddressBook>
    <people>
        <name>John Doe</name>
        <id>42</id>
        <email>[email protected]</email>
    </people>
    <people>
        <name>Jane Doe</name>
        <id>41</id>
    </people>
</AddressBook>

Here is the code:

// All this initialization stuff is tested
InputStream inputData = XMLMapperTest.class.getResourceAsStream("/data/AddressBook_several.xml");
DynamicSchema schema = SchemaParser.parseSchema(XMLMapperTest.class.getResource("/schemas/AddressBook.desc").getPath(), false);
Descriptors.Descriptor descriptor = schema.getMessageDescriptor("AddressBook");

DynamicMessage.Builder builder = DynamicMessage.newBuilder(descriptor);

XmlFormat xmlFormat = new XmlFormat();
// Here is the instruction that raises the exception
xmlFormat.merge(inputData, StandardCharsets.UTF_8, builder);

Though, I get the following error:

com.googlecode.protobuf.format.ProtobufFormatter$ParseException: 5:21: Expected ">".

	at com.googlecode.protobuf.format.XmlFormat$Tokenizer.parseException(XmlFormat.java:619)
	at com.googlecode.protobuf.format.XmlFormat$Tokenizer.consume(XmlFormat.java:418)
	at com.googlecode.protobuf.format.XmlFormat.consumeClosingElement(XmlFormat.java:680)
	at com.googlecode.protobuf.format.XmlFormat.mergeField(XmlFormat.java:764)
	at com.googlecode.protobuf.format.XmlFormat.handleObject(XmlFormat.java:882)
	at com.googlecode.protobuf.format.XmlFormat.handleValue(XmlFormat.java:775)
	at com.googlecode.protobuf.format.XmlFormat.mergeField(XmlFormat.java:755)
	at com.googlecode.protobuf.format.XmlFormat.merge(XmlFormat.java:663)
	at com.googlecode.protobuf.format.AbstractCharBasedFormatter.merge(AbstractCharBasedFormatter.java:75)
	at com.googlecode.protobuf.format.AbstractCharBasedFormatter.merge(AbstractCharBasedFormatter.java:53)
	at com.googlecode.protobuf.format.ProtobufFormatter.merge(ProtobufFormatter.java:141)
[...]

I tried with UTF-8 and ISO-8859-1 encoding but I still get the error. Then I tried to remove the dots in the email address in my XML doc and I now parse successfully.

This is the working XML:

<AddressBook>
    <people>
        <name>John Doe</name>
        <id>42</id>
        <email>johndoe@examplecom</email>
    </people>
    <people>
        <name>Jane Doe</name>
        <id>41</id>
    </people>
</AddressBook>

If you want, I can also join the Protobuf schema if you want to try by yourself.

whiver avatar Dec 11 '17 14:12 whiver

In fact it seems that even that header tag : <?xml version="1.0" encoding="UTF-8"?> causes a parsing error. The same error occurs with UTF-8 characters, such as é, è or à. I am trying to figure it out but it's quite a pain to read the parser code :p

I also tried to use a String instead of an InputStream (I cannot find a signature corresponding to the example given in the Readme!):

String xml = IOUtils.toString(inputData, StandardCharsets.UTF_8);
xmlFormat.merge(xml, ExtensionRegistry.newInstance(), builder);

But I still get the same error. I printed my xml string and I shows the original one, so the problem comes right from the parsing process.

whiver avatar Dec 14 '17 15:12 whiver

I identified the error: in fact the nextToken() method from Tokenizer, and more precisely the TOKEN constant which matches the next token is far too restrictive as it allows only a very few values. For example, dots are not allowed, neither any non-ascii character.

The current Regex is:

extension|[a-zA-Z_\s;@][0-9a-zA-Z_\s;@+-]*+|[.]?[0-9+-][0-9a-zA-Z_.+-]*+|<\/|[\\0-9]++|"([^"
\\]|\\.)*+("|\\?$)|'([^'
\\]|\\.)*+('|\\?$)

whiver avatar Dec 14 '17 19:12 whiver

Would you mind putting up a fix with a test & update RELEASE-NOTES.md?

scr avatar Dec 14 '17 19:12 scr

Yep I'll try, but I need to find the right regex first, it might not be simple. When I find a solution for sure I'll create a pull request.

whiver avatar Dec 14 '17 20:12 whiver

Is this change integrated in the master branch ? - I had the same issue and I wonder if the modifications are working - the code simplifies a lot the tokenization regexps.

bouviervj avatar Jan 12 '18 19:01 bouviervj

No it's not merged yet, since simplifying the regex has side effects. In fact I think that the whole parser should be refactored, so I left it as is for the moment, if you find a way to make it work, feel free to fork my repo :)

whiver avatar Jan 12 '18 20:01 whiver

One thing I don't understand is why they had to code their own XML parser instead of using standard ones ?

bouviervj avatar Jan 12 '18 23:01 bouviervj

I don't know, that's why I gave up trying to debug it, since the whole parser is quite restrictive and does not handle every cases. I think the only clean solution would be to reimplement everything using an existing parser but I did not have time to do it yet.

whiver avatar Jan 13 '18 14:01 whiver

Hi , I have the same problem. Is this issue still open?

oboleka avatar May 14 '20 10:05 oboleka