tinypie
tinypie copied to clipboard
Tree-Based Interpreter, compiler, and VM for TinyPie language
TinyPie - Tree-Based Interpreter, Compiler, and VM for TinyPie language
Overview
TinyPie is a tree-based interpreter for a simple programming language with a Python-like syntax.
It's based on Pie language from Language Implementation Patterns Ch.9
Quote from the book: "A tree-based interpreter is like a compiler front end with an interpreter grafted onto the end instead of a code generator"
TinyPie also includes Bytecode Assembler and Register-Based Virtual Machine.
Goals of the project
-
Self-education
-
To serve as an example for people interested in crafting their own interpreter in Python for a simple programming language or DSL
Installation
-
Using
buildout
$ git clone git://github.com/rspivak/tinypie.git $ cd tinypie $ python bootstrap.py $ bin/buildout Run the interpreter $ bin/tinypie factorial.tp
-
Using
pip
oreasy_install
(no need for sudo if usingvirtualenv
)$ sudo pip install tinypie (or run alternatively easy_install: $ sudo easy_install tinypie) Run the interpreter $ tinypie factorial.tp
Main interpreter features
-
Implemented in Python
-
Regexp-based lexer
-
LL(k) recursive-descent parser
-
Parser constructs homogeneous Abstract Syntax Tree (AST)
-
Static / lexical scope support. Interpreter builds complete scope tree during AST construction.
-
Interpeter manages global memory space and function space stack
-
Interpreter implements external AST visitor
-
Forward references support
High-level overview:
| +------------------------------------+
| source code | Symbol Table / Scope Tree |
| | |
\|/ | +-----------------------+ |
+--------------X---------------+ | | GlobalScope | |
| | | +----/-------------\----+ |
| Regexp-based lexer | | / \ |
| | |+--------/-----+ +-------\-------+|
+--------------+---------------+ ||FunctionSymbol| |FunctionSymbol ||
| |+------X-------+ +-------X-------+|
| token stream | /|\ /|\ |
\|/ | | | |
+--------------X---------------+ |+------+-------+ +-------+-------+|
| | || LocalScope | | LocalScope ||
|LL(k) recursive-descent parser| |+--------------+ +---------------+|
| | +------------------------------------+
+--------------+---------------+
| +------------------------------------+
| AST | Memory Space |
\|/ | |
+--------------X-------------+ |Function Space Global Memory |
| Interpreter | |Stack Space |
| with external tree visitor | |+-------------+ +----------------+|
| | ||FunctionSpace| | MemorySpace ||
+--------------+-------------+ |+-------------+ +----------------+|
| |+-------------+ |
| output ||FunctionSpace| |
\|/ |+-------------+ |
X +------------------------------------+
This Tree-Based Interpreter is similar to Syntax-Directed Interpreter but instead of directly executing source code during parsing the interpreter executes the source code by constructing AST and then walking the tree.
Advantages of tree-based interperter over syntax-directed one:
-
Symbol definition and symbol resolution happen at different stages:
- symbol definition is performed during parsing
- symbol resolution is performed during interpretation
This allows forward references and simplifies implementation - parser deals with scope tree only and interpreter deals with memory spaces.
-
More flexible because of AST as an intermediate representation.
Language grammar
program -> (function_definition | statement)+ EOF
function_definition -> 'def' ID '(' (ID (',' ID)*)? ')' slist
slist -> ':' NL statement+ '.' NL
| statement
statement -> 'print' expr NL
| 'return' expr NL
| call NL
| assign NL
| 'if' expr slist ('else' slist)?
| 'while' expr slist
| NL
assign -> ID '=' expr
expr -> add_expr (('<' | '==') add_expr)?
add_expr -> mult_expr (('+' | '-') mult_expr)*
mult_expr -> atom ('*' atom)*
atom -> ID | INT | STRING | call | '(' expr ')'
call -> ID '(' (expr (',' expr)*)? ')'
Short language reference
- statements:
print
,return
,if
,while
, and function call - literals: integers and strings
- operators: <, ==, +, -, *
Expressions may contain identifiers.
Code examples
-
Factorial
print factorial(6) # forward reference def factorial(x): # function definition if x < 2 return 1 # if statement return x * factorial(x - 1) # return statement . # DOT marks the block end
-
WHILE loop
index = 0 while index < 10: print index index = index + 1 . # DOT marks the block end
-
IF ELSE
x = 10 if x == 10 print 'Yes' else print 'No'
-
Variable lookup in different scopes
x = 1 # global variable def foo(x): # 'foo' is defined in global memory space print x # prints 3 (parameter value) . # DOT marks the block end def bar(): # 'bar' is defined in global memory space x = 7 # modify global variable . # DOT marks the block end foo(3) # prints 3 bar() print x # prints 7 ('bar' modified global variable)
AST Examples
Trees in these examples are represented as S-expressions for tersness.
Function call:
foo(5, 7)
(CALL ID INT INT) means CALL is the root with children ID(foo), INT(5), and INT(7)
Function definition:
def foo(x, y):
z = x + y
return z
.
(FUNC_DEF ID ID ID (BLOCK (ASSIGN ID (SUB ID ID)) (RETURN ID)))
Development
Install 'enscript' utility (optional). If you are on Ubuntu:
$ sudo apt-get install enscript
Boostrap the buildout and run it:
$ cd tinypie
$ python bootstrap.py
$ bin/buildout
Run tests, test coverage and produce coverage reports:
$ bin/test
$ bin/coverage-test
$ bin/coveragereport
Check ./var/report/tinypie.html out for coverage report.
Run pep8 and pylint to check code style and search for potential bugs:
$ bin/pep8
$ bin/pylint
AST visualizer
To see how TinyPie AST for different language constructs looks
like you can use gendot
command line utility that is generated
by buildout. This utility generates DOT file than can be further
processed by dot program to draw nice graphs.
$ echo -n 'foo(3)' | bin/gendot
digraph astgraph {
node [shape=plaintext, fontsize=12, fontname="Courier", height=.1];
ranksep=.3;
edge [arrowsize=.5]
node1 [label="BLOCK"];
node2 [label="CALL"];
node3 [label="ID (foo)"];
node4 [label="INT (3)"];
node2 -> node3
node2 -> node4
node1 -> node2
}
To draw graph and save it as a PNG image:
$ echo -n 'foo(3)' | bin/gendot > funcall.dot
$ dot -Tpng -o funcall.png funcall.dot
Bytecode Assembler
Converts assembly program into binary bytecodes. The bytecode is further interpreted by the TinyPie Register-Based Bytecode Interpreter / Virtual Machine.
TinyPie Assembly language grammar:
program -> globals? (label | function_definition | instruction | NL)+
globals -> '.globals' INT NL
label -> ID ':' NL
function_definition -> '.def' ID ':' 'args' '=' INT ',' 'locals' '=' INT NL
instruction -> ID NL
| ID operand NL
| ID operand ',' operand NL
| ID operand ',' operand NL ',' operand NL
| 'call' ID ',' operand NL
| 'loadk' REG ',' (INT | STRING) NL
operand -> REG | ID | STRING | INT
Assembler yields the following components:
-
Code memory: This is a
bytearray
containing bytecode instructions and their operands derived from the assembly source code. -
Global data memory size: The number of slots allocated in global memory for use with GSTORE and GLOAD assembly commands.
-
Program entry point: An address of main function
.def main: ...
-
Constant pool: A list of objects (integers, strings, function symbols) that are not part of the code memory. Bytecode instructions refer to those objects via integer index.
Here is a factorial function in TinyPie language:
def factorial(x):
if x < 2 return 1
return x * factorial(x - 1)
.
print factorial(5)
Here is an equivalent TinyPie assembly code:
.def factorial: args=1, locals=3
# r1 holds argument 'n'
loadk r2, 2
lt r3, r1, r2 # n < 2 ?
brf r3, cont # if n >= 2 jump to 'cont'
loadk r0, 1 # else return 1
ret
cont:
loadk r2, 1 # r2 = 1
move r3, r1 # r3 = n
sub r1, r1, r2 # r1 = n - 1
call factorial, r1 # factorial(n - 1)
mul r0, r3, r0 # n = n * result of factorial(n - 1)
ret
.def main: args=0, locals=1
loadk r1, 5
call factorial, r1 # factorial(5)
print r0 # 120
halt
Here are the resulting elements produced by the Bytecode Assembler by translating the above assembly code:
Constant pool:
0000: <FunctionSymbol: name='factorial', address=0, args=1, locals=3>
0001: 2
0002: 1
0003: <FunctionSymbol: name='main', address=95, args=0, locals=1>
0004: 3
Code memory:
0000: 6 0 0 0 2 0 0 0
0008: 1 4 0 0 0 3 0 0
0016: 0 1 0 0 0 2 13 0
0024: 0 0 3 0 0 0 41 6
0032: 0 0 0 0 0 0 0 2
0040: 9 6 0 0 0 2 0 0
0048: 0 2 14 0 0 0 3 0
0056: 0 0 1 2 0 0 0 1
0064: 0 0 0 1 0 0 0 2
0072: 16 0 0 0 0 0 0 0
0080: 1 3 0 0 0 0 0 0
0088: 0 3 0 0 0 0 9 6
0096: 0 0 0 1 0 0 0 4
0104: 16 0 0 0 0 0 0 0
0112: 1 15 0 0 0 0 10
Bytecode instructions are described in the following section.
Register-Based Bytecode Interpreter / Virtual Machine
Virtual Machine architecture:
-
IP: Instruction pointer register that points into the code memory at the next instruction to execute.
-
CPU: Instruction dispatcher that simulates fetch-decode-execute cycle with a switch (if elif) statement in a loop - reads bytecode at IP, decodes its operands and executes corresponding operation.
-
Global data memory: a list of global objects. Contents is accessed via integer index.
-
Code memory: Holds bytecode instructions and their operands.
-
Call stack: Holds StackFrame objects with function return address, parameters, and local variables.
-
Stack frame: A StackFrame object that holds all required information to invoke a function:
- function symbol
- function return address
- registers hold return value, arguments, locals, and temporary values
-
FP: Frame pointer - a special-purpose register that points to the top of the function
call stack
-
Constant pool: Integers, strings, and function symbols all go into constant pool. Instructions refer to constant pool values via an integer index.
High-level overview:
|
TinyPie | assembly
|
V
+-----------------------+
| |
| Bytecode Assembler |
| |
+----------+------------+
|
code memory(bytecode) | constant pool
V
+--------------------------------------------------------------------------+
| |
| Register-Based Bytecode Interpreter / Virtual Machine |
| |
| +------------------------+ +------------------------------------+ |
| | | | Function Call Stack | |
| | Constant Pool +----+ | | |
| | (integer, string, | | | | |
| | function symbol) | | | +--------------------------------+ | |
| +------------------------+ | | | Stack Frame | | |
| | | | | | |
| | | |function symbol: FS('main') | | |
| | | |return address: 0 | | |
| +------------------------+ | | | ret| args|locals | | |
| | | | | |registers: r0|r1 r2|r3 r4 r5 | | |
| | Global data memory | | | +--------------------------------+ | |
| | +-| | | | |
| | | | | | | |
| +------------------------+ | | | +--------------------------------+ | |
| | | | | Stack Frame | | |
| | | | | | | |
| | | | |function symbol: FS('fact') | | |
| +------------------------+ | | | |return address: 17 | | |
| | | | | | | ret| args |locals | | |
| | Code memory | | | | |registers: r0|r1 r2 r3|r4 | | |
| | (bytecode) | | | | +--------------------------------+ | |
| | | | | | | |
| +----------+-------------+ | | | | |
| | | | +------------------------------------+ |
| V | | |
+ +------------------------+ | | |
| | | | | |
| | CPU <-+ | |
| | (fetch-decode-execute) | | |
| | <----+ |
| +------------------------+ |
| |
+--------------------------------------------------------------------------+
Bytecode instructions for TinyPie VM:
# Index serves as an opcode
INSTRUCTIONS = [
None,
Instruction('add', REG, REG, REG), # A B C R(A) = R(B) + R(C)
Instruction('sub', REG, REG, REG), # A B C R(A) = R(B) - R(C)
Instruction('mul', REG, REG, REG), # A B C R(A) = R(B) * R(C)
Instruction('lt', REG, REG, REG), # A B C R(A) = R(B) < R(C)
Instruction('eq', REG, REG, REG), # A B C R(A) = R(B) == R(C)
Instruction('loadk', REG, POOL), # A B R(A) = CONST_POOL[B]
Instruction('gload', REG, POOL), # A B R(A) = GLOBALS[B]
Instruction('gstore', POOL, REG), # A B GLOBALS[A] = R(B)
Instruction('ret'),
Instruction('halt'),
Instruction('br', INT), # A branch to A
Instruction('brt', REG, INT), # A B R(A) is True -> branch to B
Instruction('brf', REG, INT), # A B R(A) is False -> branch to B
Instruction('move', REG, REG), # A B R(A) = R(B)
Instruction('print', REG), # A print R(A)
Instruction('call', FUNC, REG), # A B call A, R(B)
]
TinyPie VM comes with a tpvm
command line utility:
$ bin/tpvm -h
Usage: tpvm [input file]
If no input file is provided STDIN is used by default.
Options:
-h, --help show this help message and exit
-i FILE, --input=FILE
Input file. Defaults to standard input.
-c, --coredump Print coredump to standard output.
-d, --disasm Print disassembled code to standard output.
-t, --trace Print execution trace.
Example output:
fact.tps
.def fact: args=1, locals=3
# r1 holds argument 'n'
loadk r2, 2
lt r3, r1, r2 # n < 2 ?
brf r3, cont # if n >= 2 jump to 'cont'
loadk r0, 1 # else return 1
ret
cont:
loadk r2, 1 # r2 = 1
move r3, r1 # r3 = n
sub r1, r1, r2 # r1 = n - 1
call fact, r1 # fact(n - 1)
mul r0, r3, r0 # n = n * result of fact(n - 1)
ret
.def main: args=0, locals=1
loadk r1, 3
call fact, r1 # fact(3)
print r0 # 6
halt
$ bin/tpvm --coredump --disasm --trace -i /tmp/fact.tps
0095: LOADK r1, #4:3 main.registers=[? | ?] calls=[main]
0104: CALL #0:fact@0, r1 main.registers=[? | 3] calls=[main]
0000: LOADK r2, #1:2 fact.registers=[? | 3 | ? ? ?] calls=[main fact]
0009: LT r3, r1, r2 fact.registers=[? | 3 | 2 ? ?] calls=[main fact]
0022: BRF r3, 41 fact.registers=[? | 3 | 2 0 ?] calls=[main fact]
0041: LOADK r2, #2:1 fact.registers=[? | 3 | 2 0 ?] calls=[main fact]
0050: MOVE r3, r1 fact.registers=[? | 3 | 1 0 ?] calls=[main fact]
0059: SUB r1, r1, r2 fact.registers=[? | 3 | 1 3 ?] calls=[main fact]
0072: CALL #0:fact@0, r1 fact.registers=[? | 2 | 1 3 ?] calls=[main fact]
0000: LOADK r2, #1:2 fact.registers=[? | 2 | ? ? ?] calls=[main fact fact]
0009: LT r3, r1, r2 fact.registers=[? | 2 | 2 ? ?] calls=[main fact fact]
0022: BRF r3, 41 fact.registers=[? | 2 | 2 0 ?] calls=[main fact fact]
0041: LOADK r2, #2:1 fact.registers=[? | 2 | 2 0 ?] calls=[main fact fact]
0050: MOVE r3, r1 fact.registers=[? | 2 | 1 0 ?] calls=[main fact fact]
0059: SUB r1, r1, r2 fact.registers=[? | 2 | 1 2 ?] calls=[main fact fact]
0072: CALL #0:fact@0, r1 fact.registers=[? | 1 | 1 2 ?] calls=[main fact fact]
0000: LOADK r2, #1:2 fact.registers=[? | 1 | ? ? ?] calls=[main fact fact fact]
0009: LT r3, r1, r2 fact.registers=[? | 1 | 2 ? ?] calls=[main fact fact fact]
0022: BRF r3, 41 fact.registers=[? | 1 | 2 1 ?] calls=[main fact fact fact]
0031: LOADK r0, #2:1 fact.registers=[? | 1 | 2 1 ?] calls=[main fact fact fact]
0040: RET fact.registers=[1 | 1 | 2 1 ?] calls=[main fact fact fact]
0081: MUL r0, r3, r0 fact.registers=[1 | 1 | 1 2 ?] calls=[main fact fact]
0094: RET fact.registers=[2 | 1 | 1 2 ?] calls=[main fact fact]
0081: MUL r0, r3, r0 fact.registers=[2 | 2 | 1 3 ?] calls=[main fact]
0094: RET fact.registers=[6 | 2 | 1 3 ?] calls=[main fact]
0113: PRINT r0 main.registers=[6 | 3] calls=[main]
6
Constant pool:
0000: <FunctionSymbol: name='fact', address=0, args=1, locals=3>
0001: 2
0002: 1
0003: <FunctionSymbol: name='main', address=95, args=0, locals=1>
0004: 3
Code memory:
0000: 6 0 0 0 2 0 0 0
0008: 1 4 0 0 0 3 0 0
0016: 0 1 0 0 0 2 13 0
0024: 0 0 3 0 0 0 41 6
0032: 0 0 0 0 0 0 0 2
0040: 9 6 0 0 0 2 0 0
0048: 0 2 14 0 0 0 3 0
0056: 0 0 1 2 0 0 0 1
0064: 0 0 0 1 0 0 0 2
0072: 16 0 0 0 0 0 0 0
0080: 1 3 0 0 0 0 0 0
0088: 0 3 0 0 0 0 9 6
0096: 0 0 0 1 0 0 0 4
0104: 16 0 0 0 0 0 0 0
0112: 1 15 0 0 0 0 10
Disassembly:
0000: LOADK r2, #1:2
0009: LT r3, r1, r2
0022: BRF r3, 41
0031: LOADK r0, #2:1
0040: RET
0041: LOADK r2, #2:1
0050: MOVE r3, r1
0059: SUB r1, r1, r2
0072: CALL #0:fact@0, r1
0081: MUL r0, r3, r0
0094: RET
0095: LOADK r1, #4:3
0104: CALL #0:fact@0, r1
0113: PRINT r0
0118: HALT
Roadmap
- AST -> TAC (Three-Address Code)
- TAC -> Bytecode Assembly
- VM in C/C++ (here we go, the joy of manual memory management)