parsing-techniques
                                
                                 parsing-techniques copied to clipboard
                                
                                    parsing-techniques copied to clipboard
                            
                            
                            
                        Lecture on Parsing Techniques
Parsing Techniques
There are lots of formal languages for various kinds of practical purposes. But they all have one thing in common: for further processing them inside a program, they first have to be parsed from their character string representation. This is the material of a lecture about various techniques to perform this parsing step.
Notice: the code is all written in ECMAScript 6, is on-the-fly transpiled to ECMAScript 5 and then executed under Node.js, but it actually doesn't matter very much. Equivalent code can be written in Java or C#, too. The only major point is just that the required third-party libraries have to be also changed, of course.
Parsing Input
Let's imagine a formal language for describing key/value based configurations in a redundancy-free nested structure. A sample configuration can be:
foo {
    baz = 7 // some comment
    bar {
        quux = 42
        hello = "{hello} = \"world\"!"
    }
    quux = 3
}
bar  = 1
quux = 2
This is a very simple formal language, but it already has some cruxes which can become a hurdle for parsing:
- nested sections
- intermixed comments
- alternatives (value is either number or string)
- string value can contain spaces, quotes and section braces
Parsing Output
Let's imagine we want to parse configurations in the above format into a simple key/value format where the sections are flattened:
foo.bar.quux 42
foo.bar.hello {hello} = "world"!
foo.baz 7
foo.quux 3
bar 1
quux 2
Parsing Techniques
There are various parsing techniques available, each with their pros and
cons. For illustration purposes we've implemented a bunch of them. Each
one can be run by executing make <id> where <id> is one of 0-re,
1-sm, 2-sm-ast, 3-ls-rdp-ast or 4-peg-ast. Follow the above
links to their particular source code and documentation.
- 
cfg2kv-0-re/:
 Regular Expressions (RE)
- 
cfg2kv-1-sm/:
 State Machine (SM)
- 
cfg2kv-2-sm-ast/:
 State Machine (SM), Abstract Syntax Tree (AST)
- 
cfg2kv-3-ls-rdp-ast/:
 Lexical Scanner (LS), Recursive Descent Parser (RDP), Abstract Syntax Tree (AST)
- 
cfg2kv-4-pc-ast/:
 Parser Combinators (PC), Abstract Syntax Tree (AST)
- 
cfg2kv-5-peg-ast/:
 Parsing Expression Grammar (PEG) Parser, Abstract Syntax Tree (AST)