discussion icon indicating copy to clipboard operation
discussion copied to clipboard

Forth Virtual Machine for Arduino

Open mikaelpatel opened this issue 8 years ago • 12 comments

Started a new Forth kernel project targeted for the Arduino. This time a more traditional multi-tasking token thread kernel with the focus on performance, foot-print and integration with Arduino core functions and libraries.

Written in C++ with approx. 110 kernel functions. Kernel foot-print is 3.5Kbyte without dictionary strings. The dictionary strings adds 750 bytes. A built-in trace adds approx. 1Kbyte. https://github.com/mikaelpatel/Arduino-FVM/blob/master/examples/Blink/Blink.ino

Follow the development on https://github.com/mikaelpatel/Arduino-FVM.

Cheers!

mikaelpatel avatar Dec 23 '16 12:12 mikaelpatel

My VM is like your, perhaps this things help, perhaps not, my target is i86.

In switch for dispach code work better continue and not goto next (a speedup when change this)

Some branchless code ..

I not have min and max not in the core but.

::min over - dup 31 >> and + ; ::max over swap - dup 31 >> and - ;

is rshift with sign preserve.. if 16 bits change the number

I have abs in the core words and this is the implementation

case ABS: W=(TOS>>31);TOS=(TOS+W)^W;continue;

In my github is all the code if you like see.

Good luck

phreda4 avatar Dec 23 '16 21:12 phreda4

@phreda4 Thanks for the optimization suggestions. I have not yet come to this level of optimizations yet. Do you have any benchmarks on branch-less threaded code? The stall due to conditional branches depends very much on the length instruction pipe-line of the processor and possible branch prediction support. The AVR has a very short pipe-line 2 stages while i86 can have 5-10.

Cheers!

PS: Would it be possible to use 0< in the min/max code about instead of 31 >>? I think there is also a GCC built in function for fast signbit detect.

mikaelpatel avatar Dec 24 '16 08:12 mikaelpatel

For even more ideas, you might have
    a look at https://github.com/MitchBradley/cforth
    
    As with your FVM, it's switch-threaded.  I first wrote it about
    35 years ago and have been refining it ever since.  It runs on
    16, 32, and 64 bit architectures - although I haven't done
    anything with 16-bit systems for years.  You can buy 32-bit ARM
    chips these days for about US$1, so I have a hard time
    justifying the use of a 16-bit processor.  For less than $10 you
    can get an ESP32 module with two fast ARM cores, WiFi and
    Bluetooth, 16MB of FLASH, and 512KB of RAM.
    
    When I first wrote it in the 80s, C compiler code generation was
    not particularly good.  I used every trick that I knew to make
    it fast.  It was within a factor of 2 of the speed of my
    direct-threaded native Forth for 68K.  Lately I been much less
    concerned with speed; it's fast enough for everything I have
    wanted to do, and it's easy to add special-purpose primitives in
    C if you need the last ounce of speed for things like
    bit-banging high-rate serial protocols.
    
    It's very complete, with a full suite of programmer-productivity
    tools like a decompiler with syntax highlighting, visual
    debugger, and command completion.  It can serve as the base for
    a full-on Open Firmware implementation, it can run on 64-systems
    with OPENGL graphics access, yet it can run in a stripped-down
    form, albeit with a live text interpreter, on the processor core
    of a Bluetooth chip whose only memory is 20K of free RAM.  There
    is a tethered mode where you can run the full "all the tools"
    Forth environment on a host (PC or Mac), connecting to an
    embedded processor that has a 700-byte communications stub
    through which he host Forth can access the memory of the target
    system and script C subroutines from the target app.
    
    It's also quite portable, having been used on PCs from the 16
    through 64 bit generations, Macs from 68K to PowerPC to x86 32
    and 64 bit, many different RISC servers, and many different
    embedded chips including ARM (original and Thumb instruction
    sets), Xtensa, and others.  At one point I had it running on a
    cell phone that only ran Java apps.  (The Java
      code is still in the tree but is stale; I haven't maintained
      its I/O interfaces.)  I used a C preprocessor script to
    translate the kernel C code into a form that the Java compiler
    would accept.  That is why some of the C constructs in the
    kernel are a little funny - I converted all the C pointer
    indirection operators to array accesses (Java doesn't like
    pointers).  Back in the old days, that would have caused a
    slowdown, but modern C compilers are so good that it makes
    little difference now.
    
    On 12/24/2016 2:54 AM, Mikael Patel wrote:
  

  @phreda4 Thanks for the
      optimization suggestions. I have not yet come to this level of
      optimizations yet. Do you have any benchmarks on branch-less
      threaded code? The stall due to conditional branches depends
      very much on the length instruction pipe-line of the processor
      and if possible branch prediction support. The AVR has a very
      short pipe-line 2 stages while i86 can have 5-10.
  Cheers!
  PS: Would it be possible to use 0< in the min/max code
      about instead of 31 >>? I think there is also a GCC
      built in function for fast signbit detect.
  —
    You are receiving this because you are subscribed to
      this thread.
    Reply to this email directly, view
        it on GitHub, or mute
        the thread.
  
    
      
       
      
    
    
  
  {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/ForthHub/discussion","title":"ForthHub/discussion","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/ForthHub/discussion"}},"updates":{"snippets":[{"icon":"PERSON","message":"@mikaelpatel in #37: @phreda4 Thanks for the optimization suggestions. I have not yet come to this level of optimizations yet. Do you have any benchmarks on branch-less threaded code? The stall due to conditional branches depends very much on the length instruction pipe-line of the processor and if possible branch prediction support. The AVR has a very short pipe-line 2 stages while i86 can have 5-10. \r\n\r\nCheers!\r\n\r\nPS: Would it be possible to use 0\u003c in the min/max code about instead of 31 \u003e\u003e? I think there is also a GCC built in function for fast signbit detect. "}],"action":{"name":"View Issue","url":"https://github.com/ForthHub/discussion/issues/37#issuecomment-269075804"}}}

MitchBradley avatar Dec 24 '16 14:12 MitchBradley

mikael: I don't test the speed, if you make some test is good to see!

I don't know AVR asm and for this say perhaps.

the 0< code can work if you get -1 is negative and 0 otherwise, i86 have CDQ for sign extend, but need EAX for TOS and EDX be free, perhaps AVR asm have a similar instruction. My conditional and control structures are diferent from normal forth, the I not have 0<

phreda4 avatar Dec 24 '16 15:12 phreda4

The AVR ISA is a typical RISC one with short pipeline and no branch target prediction to my knowledge for which case statements can be compiled with various results from quite sufficient to really horrible. I suggest to compile the threading test from Anton Ertl as comparison:

http://www.complang.tuwien.ac.at/forth/threaded-code.html

It's also a good idea to take a look at the generated assembler listings and change optimization flags to study differences of the generated machine code.

Mat2 avatar Dec 24 '16 21:12 Mat2

With some conditions (consecutive numbers and full cover options) the switch/case statements compile to "on..goto"construcion, a calculate jump.

phreda4 avatar Dec 24 '16 22:12 phreda4

Mitch,

This summer I acquired a Mac Classic 2 - circa 1990.

I have half a plan to turn it into a Forth workstation - the ultimate retro machine.

The hard disk is missing but it boots from floppy - and I might just add a 400MHz STM32F746 - to act as coprocessor.

Interested in OpenGL, and any open source tools for engineering - such as CAD, pcb layout, transistor/gate array/FPGA logic design.

There really has to be a better route - to the one we are currently following.....

regards

Ken

On 24 December 2016 at 14:54, Mitch Bradley [email protected] wrote:

For even more ideas, you might have a look at https://github.com/MitchBradley/cforth

As with your FVM, it's switch-threaded. I first wrote it about 35 years ago and have been refining it ever since. It runs on 16, 32, and 64 bit architectures - although I haven't done anything with 16-bit systems for years. You can buy 32-bit ARM chips these days for about US$1, so I have a hard time justifying the use of a 16-bit processor. For less than $10 you can get an ESP32 module with two fast ARM cores, WiFi and Bluetooth, 16MB of FLASH, and 512KB of RAM.

When I first wrote it in the 80s, C compiler code generation was not particularly good. I used every trick that I knew to make it fast. It was within a factor of 2 of the speed of my direct-threaded native Forth for 68K. Lately I been much less concerned with speed; it's fast enough for everything I have wanted to do, and it's easy to add special-purpose primitives in C if you need the last ounce of speed for things like bit-banging high-rate serial protocols.

It's very complete, with a full suite of programmer-productivity tools like a decompiler with syntax highlighting, visual debugger, and command completion. It can serve as the base for a full-on Open Firmware implementation, it can run on 64-systems with OPENGL graphics access, yet it can run in a stripped-down form, albeit with a live text interpreter, on the processor core of a Bluetooth chip whose only memory is 20K of free RAM. There is a tethered mode where you can run the full "all the tools" Forth environment on a host (PC or Mac), connecting to an embedded processor that has a 700-byte communications stub through which he host Forth can access the memory of the target system and script C subroutines from the target app.

It's also quite portable, having been used on PCs from the 16 through 64 bit generations, Macs from 68K to PowerPC to x86 32 and 64 bit, many different RISC servers, and many different embedded chips including ARM (original and Thumb instruction sets), Xtensa, and others. At one point I had it running on a cell phone that only ran Java apps. (The Java code is still in the tree but is stale; I haven't maintained its I/O interfaces.) I used a C preprocessor script to translate the kernel C code into a form that the Java compiler would accept. That is why some of the C constructs in the kernel are a little funny - I converted all the C pointer indirection operators to array accesses (Java doesn't like pointers). Back in the old days, that would have caused a slowdown, but modern C compilers are so good that it makes little difference now.

On 12/24/2016 2:54 AM, Mikael Patel wrote:

@phreda4 Thanks for the optimization suggestions. I have not yet come to this level of optimizations yet. Do you have any benchmarks on branch-less threaded code? The stall due to conditional branches depends very much on the length instruction pipe-line of the processor and if possible branch prediction support. The AVR has a very short pipe-line 2 stages while i86 can have 5-10. Cheers! PS: Would it be possible to use 0< in the min/max code about instead of 31 >>? I think there is also a GCC built in function for fast signbit detect. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":" 05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity": {"external_key":"github/ForthHub/discussion","title":" ForthHub/discussion","subtitle":"GitHub repository","main_image_url":" https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88- 11e6-95fc-7290892c7bb5.png","avatar_image_url":"https:// cloud.githubusercontent.com/assets/143418/15842166/ 7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/ForthHub/discussion"}}," updates":{"snippets":[{"icon":"PERSON","message":"@mikaelpatel in #37: @phreda4 Thanks for the optimization suggestions. I have not yet come to this level of optimizations yet. Do you have any benchmarks on branch-less threaded code? The stall due to conditional branches depends very much on the length instruction pipe-line of the processor and if possible branch prediction support. The AVR has a very short pipe-line 2 stages while i86 can have 5-10. \r\n\r\nCheers!\r\n\r\nPS: Would it be possible to use 0\u003c in the min/max code about instead of 31 \u003e\u003e? I think there is also a GCC built in function for fast signbit detect. "}],"action":{"name":"View Issue","url":"https://github. com/ForthHub/discussion/issues/37#issuecomment-269075804"}}}

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ForthHub/discussion/issues/37#issuecomment-269087472, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuUP0fMRKa7m7QatYmfScHLhrHYVHj2ks5rLTIRgaJpZM4LUyDb .

monsonite avatar Dec 26 '16 01:12 monsonite

There are several advantages in making Forth/Threaded Interpreters available within the Arduino community. With the limited amount of resources, limited support for debugging and many newcomers this gives a great environment for teaching and exposing students and hobbyists to Forth and similar languages.

Obviously integration with the Arduino core and libraries is essential.

One of the interesting challenges is the AVR Processor Harvard Architecture. Most Forth kernels are written for a Von-Neumann Architecture.

My first attempt was the Arduino-Shell project which abstracts the AVR separate memory architecture to a common address space; https://github.com/mikaelpatel/Arduino-Shell

This project started as a port of VFM; https://github.com/mikaelpatel/vfm, which has really great performance on Linux/x86 with only some register hinting to gcc.

I think it is important to make forth available as an embedded interactive hardware and firmware debugging tool and especially focus on integration with other languages, i.e., move away from the single application language paradigm.

Cheers!

mikaelpatel avatar Dec 26 '16 12:12 mikaelpatel

Some progress over the holidays. The kernel is now 120 primitives (3.5 Kbyte without dictionary strings). Many kernel operations are defined both in C/C++ and threaded code to allow tailoring for foot-print and/or speed. An initial token compiler is also available. To make things simple :) it is written for the Arduino. Please see https://github.com/mikaelpatel/Arduino-FVM for some screen-shots and further details.

mikaelpatel avatar Jan 03 '17 12:01 mikaelpatel

Further progress and the first release of this project is soon in sight. The kernel is now 128 primitives (3.7 Kbyte without dictionary table and strings). The token compiler is completed.

The examples range from a simple shell to benchmarks. The multi-tasking benchmarks shows that a context switch to and from the virtual machine is approx. 9 us (Arduino Uno, ATmega328@16Mhz). This includes two threaded instructions (yield, branch). A pure context switch is approx. 6 us (halt).

mikaelpatel avatar Jan 06 '17 19:01 mikaelpatel

The Forth Virtual Machine (FVM) has been updated to allow threaded code in data and program memory. A Forth style outer interpreter with compiler has been added; https://github.com/mikaelpatel/Arduino-FVM/blob/master/examples/Forth/Forth.ino. Typically on an Arduino Uno (2Kbyte SRAM) this will allow for adding approx. 32 new definitions in a 1 Kbyte data area. The Arduino Mega (8Kbyte SRAM) gives much more to play with (approx. 128 definitions, 7 Kbyte data area).

An interesting optimization in this byte token threaded virtual machine is built-in tail call reduction in the inner interpreter, i.e. return addresses are only pushed on the return stack when the next instruction is not exit.

mikaelpatel avatar Jan 21 '17 16:01 mikaelpatel

The latest version (1.1) adds support for Arduino Due (SAM); 32-bit cell size, RAM based (no mixed memory management). Please see the example sketch Forth.ino; https://github.com/mikaelpatel/Arduino-FVM/blob/master/examples/Forth/Forth.ino

mikaelpatel avatar Mar 05 '17 22:03 mikaelpatel