Discussion: Designing and implementing multitasking
Neither ANS Forth nor Gforth come with a standardized wordset for multitasking, let alone a reference implementation of some sort.
The consensus on 6502.org seems to be that it can work well with round-robin cooperative designs, where especially interactive words like emit give up some time through a word like pause. The other option would seem to be IRQs, see Garth's Simple methods for multitasking without a multitasking OS (http://wilsonminesco.com/multitask/index.html and also http://forum.6502.org/viewtopic.php?f=2&t=2281). That would be harder to simulate, given that py65 for instance has no way of including interrupts, let alone timed interrupts.
One way or the other, we'll have to figure out how many tasks we can accept given the size of the stack and the space left for user variables. I'm wondering if four is enough?
More links, starting (of course) with Brad Rodriguez' Forth Multitasking in a Nutshell (http://www.bradrodriguez.com/papers/mtasking.html). Then we have a higher-level suggestion A multi-tasking wordset for Standard Forth by Andrew Haley from EuroForth (http://www.complang.tuwien.ac.at/anton/euroforth/ef17/papers/haley-slides.pdf). There is a bunch of Forth e.V. stuff at https://wiki.forth-ev.de/doku.php/projects:4dinhalt in German I haven't looked at yet, especially https://wiki.forth-ev.de/lib/exe/fetch.php/vd-archiv:4d2014-03.pdf. There is this description of a functioning system: http://www.mosaic-industries.com/embedded-systems/sbc-single-board-computers/freescale-hcs12-9s12-c-language/instrument-control/forth-real-time-operating-system for inspiration.
I think that a round-robin cooperative setup makes the most sense for Tali as there is no standardized clock. The simulator doesn't have one (although it's easy to extend so that is does), and who knows what users will have for hardware. The cooperative system will work on the widest range of hardware and avoids interrupts, which can be problematic for users who are new to microprocessors.
In my mind, there are two types of tasks.
If a task can guarantee that it will only PAUSE with no items on the data stack, then it can actually share a data stack. I'll call this a "light" task. Tasks that need to keep stuff on the stack while calling PAUSE (or tasks that are going to do serial I/O if PAUSE is added to the serial I/O words as is recommended in some of the sources you've cited) are "full" tasks and they will need their own data stack.
If you break the data stack into three pieces, with the first piece having half of the current ZP stack memory (I'll call this A) and other other pieces each getting a quarter (I'll call these B and C), then A can be used for the "main" task as well as quite a few "light" tasks sharing the same stack. This allows for the main task to still have a reasonably deep stack if the "light" tasks aren't too stack hungry. Because the "light" tasks promise to leave nothing on the stack when they PAUSE, the total stack needed for area A is only the main stack usage plus the stack usage from only the most greedy "light" task. The areas B and C can be used for "full" tasks that run separately from the main task. This setup allows 2 "full" tasks in addition to the main task, and as many "light" tasks as you want to code support for.
Regardless of the task type, each tasks is going to need its own return stack if they want to be able to PAUSE in something like a DO/LOOP. It's really too bad that the 65C02 return stack is fixed in location, as that makes things more difficult.
Argh, I hadn't even thought about the Return Stack that much. I should never have touched the 65816, things were so much easier there ... the problem is, how do you know if a task is "light" or "full" ("heavy"?) if it is based on a user-defined word?
Maybe one solution would be to say, you get four tasks, and then we could hardwire certain areas on both stacks. You could probably do a lot more with interrupts, but unless py65 decides to support that out of the box, I'd really rather not go there.
This should probably wait until we have moved to a different assembler (see #252) and possibly until we have restructured the code for modular stuff.