petitparser2 icon indicating copy to clipboard operation
petitparser2 copied to clipboard

Parsing binary data

Open udoschneider opened this issue 3 years ago • 3 comments

I basically have a parser ready to parse text input. However another representation of the source is binary - although with the exact same AST structure. It seems that adding those two methods allow me to treat ByteArray as byte sequences and Integers as bytes:

ByteArray>>#asPParser
	^ PP2LiteralSequenceNode on: self

Integer>>#asPParser
	^ PP2LiteralObjectNode on: self

Thinking about it doing something like

SequencableCollection>>#asPParser
	^ PP2LiteralSequenceNode on: self

would even allow parsing "numeric collection" in general ...

Is this the way to go?

udoschneider avatar Jul 13 '21 08:07 udoschneider

Hi Udo,

not sure what is your goal. the mentioned asPParser methods allow you to do:

For ByteArray: 'foobar' asPParser parse: 'foobar'

I am not sure what exactly the Integer>>asPParser do. Can it be used as following? 'a' asInteger asPParser parse: 'a'

What kind of use case would you like to add?

kursjan avatar Jul 16 '21 17:07 kursjan

Hi Kursjan,

the generic idea is to be able to parse binary data (given as ByteArray) where each element is an Integer (byte). I can't disclose the protocol I work on (NDA) but I think WebAssembly is a good example.

E.g. the text format (p. 132) defines

For example, the textual grammar for value types is given as follows:

valtype ::= ‘i32’ ⇒ i32
| ‘i64’ ⇒ i64
| ‘f32’ ⇒ f32
| ‘f64’ ⇒ f64

E.g. the binary format (p. 114) defines

For example, the binary grammar for value types is given as follows:

valtype ::= 0x7F ⇒ i32
| 0x7E ⇒ i64
| 0x7D ⇒ f32
| 0x7C ⇒ f64

However once the valtype token has been parsed all the higher level combination rules work exactly the same.

So the basic idea would be for PP to be able to parse binary literals by adding ByteArray>>#asPParser and Integer>>#asPParser. This would allow to define a WASMTextParser as subclass of PP2CompositeNode with

valtype
    ^ ('i32' asPParser / 'i64' asPParser / 'f32' asPParser / 'f64' asPParser) ==> [:type | WASMValtypeNode type: type]

WASMTextParser would then define all the production rules on top of this valtype definition.

And in WASMBinaryParser (as subclass of WASMTextParser) would simply overwrite valtype as

valtype
    ^ (16r7F asPParser / 16r7E asPParser / 16r7D asPParser / 16r7C asPParser) ==> [:type | WASMValtypeNode type: type]

However all the higher level production rules in the superclass would still work.

So the only difference here would be how to parse literals - string on one hand (as usual) but also binary (what I proposed).

Does that help?

udoschneider avatar Jul 19 '21 09:07 udoschneider

Hi Udo,

did I get it right that WASMTextParser should already work?

valtype
    ^ ('i32' asPParser / 'i64' asPParser / 'f32' asPParser / 'f64' asPParser) ==> [:type | WASMValtypeNode type: type]

String>>asPParser is already defined and would create a LiteralSequence parser.

Your proposal of extending ByteArray, Integer with asPParser sounds pretty much OK and aligned with the current PetitParser design. How would the extension look like? Something along these lines?

Integer>>asPParser
  ^ PP2LiteralObjectNode on: (Character from: self)

kursjan avatar Aug 01 '21 13:08 kursjan