artiq icon indicating copy to clipboard operation
artiq copied to clipboard

[RFC] Distributed DRTIO mastering and DMA

Open dhslichter opened this issue 3 years ago • 21 comments

ARTIQ Feature Request

Problem this request addresses

Many pieces of ARTIQ Sinara hardware, such as Fastino, Phaser, SU servo, and (in the future) Shuttler, require relatively large amounts of data to be streamed to them from the DRTIO master device. In some cases, such as Fastino, saturating the output sample rate on all channels requires DMA. For setups with large numbers of these boards, the need to push all this data from the DRTIO master device can become a bottleneck and eventually limits performance/scaling.

Describe the solution you'd like

I propose developing a design for distributed DRTIO mastering and distributed DMA. One notion, after some discussions with others, would be to have a single DRTIO "root" master at the root of the DRTIO tree, where the downstream leaves in the tree could either be DRTIO satellites (as they are now) or DRTIO masters. Those downstream DRTIO masters can have further downstream DRTIO satellites and/or DRTIO masters, and so forth recursively as desired.

Any given DRTIO master would interact directly with its subtree -- consisting of any satellites downstream of it (provided there is no other DRTIO master in the tree between that satellite and this master), as well as any DRTIO masters downstream (but not their subtrees) -- as well as the nearest upstream DRTIO master. In the case of satellites in the subtree, the DRTIO master would behave exactly as it does under the current implementation of DRTIO. It is only if there are downstream and/or upstream DRTIO masters that things would differ. We would say a DRTIO master "owns" satellites in its subtree (including itself), and "manages" any masters in its subtree, along with everything in the subtrees of those masters. Thus any given RTIO destination would be "owned" by a single unique master, but would be "managed" by zero or more other DRTIO masters which are upstream of the owning master. One would have a routing table of some sort to keep track of the owners, as well as the hierarchy of managers, for all RTIO destinations. This routing table would probably have to be stored on all nodes of the tree.

There then would need to be a mechanism by which the different RTIO events in a given experiment are carried out as kernels running on the various masters that own the corresponding RTIO destinations. The idea would be to have the execution of the whole experiment kernel proceed in a distributed parallel fashion across all DRTIO masters in the tree. Each master would be loaded with, and responsible for the execution of, a kernel comprising that portion of the whole experiment that involves RTIO destinations it owns (as is currently the case in the single-master architecture). An additional level of complication comes in when interaction is required between something in a kernel running on a given master and an RTIO destination not owned by that master (for example, branching based on some input from an RTIO destination owned by a different master). In this case, each of the masters involved in such an interaction would need to communicate with the others to determine how the branch proceeds. More discussion about all of this is below.

Additional context

Here are a few more detailed ideas about implementation, some questions, and some implications that need discussion.

  1. Kernel operations not specific to a given RTIO destination. Kernels can include various statements like mathematical operations, get_dataset or set_dataset, and so forth that are not directly attributable to a particular RTIO destination. Where and how should those be executed? To what extent would one be able to/wish to trace back dependencies in the compiler to assign where these are executed? For example, if some math operations are carried out to determine the duration of a pulse or the frequency to set a DDS to, then those math operations might potentially be carried out on the master that owns the corresponding RTIO destination. However, one can have corner cases: for example, where the math relies on inputs from multiple DRTIO masters, or where the result of the calculation is used by multiple RTIO destinations that are owned by different DRTIO masters. In such a case, one potential solution would be to have the "lowest common manager" carry out the calculations, and distribute the result down to the various masters it manages that own the relevant RTIO destinations. A calculation which is not known to the compiler to be related to resources owned by any particular DRTIO master could be assigned in one of several ways (random DRTIO master, root DRTIO master, the DRTIO master whose kernel otherwise contains the least "work" in some sense, etc). Some of these would be more painful to implement than others. For things like get_dataset or set_dataset, it may be even less clear which master should be doing the work, although the general principle of "assign it to the master that owns RTIO destinations on which this call depends" or "lowest common manager if there are multiple such owning masters" would potentially be reasonable. There is also the possibility for race conditions if some masters are setting datasets and others are getting datasets, but my general feeling here would be to just say "this is bad code practice" rather than trying to engineer solutions. Compiler warnings of multiple masters setting and getting a single dataset, for example, could help.

  2. Handling branching/interactions between masters. In this situation, one solution would be to have the "lowest common manager" arbitrate the branching. If a kernel on master B is waiting to branch based on an input owned by master A, then the lowest common master C of A and B could have in its kernel an operation where it waits for an input to be sent up to it from A, which it then turns around and sends down to B. Alternatively, one could have a situation where master C's kernel is not affected, but rather the DRTIO network sends a message from A to B with its information. The way the message propagates is to go upstream until it reaches the lowest common manager (C), at which point the DRTIO message handler on C recognizes that the destination is one of the masters (B) that it manages, and then sends the message downstream to B. If there are multiple receiving masters for a message, intermediate masters can send a copy down to any receiving masters it manages, and a copy up if not all of the receiving masters are managed by it. All of this would require the compiled kernels running on each master to know something about destinations for certain information they create (RTIO inputs, results of calculations, etc). I don't know how complicated this might end up being at the compiler level. While the first solution (compiler adds explicit instruction to kernel of C to determine what should happen at the branch) may seem silly for this simple case of just A and B, one could imagine a case where multiple masters are affected by the result of the branching calculation. In this case, one could either choose to determine the outcome of the branching calculation at the lowest common manager and then propagate just the results down to the masters who need to know this outcome, or one could propagate all inputs to all masters who need to know the outcome and have them each perform the calculation independently. It seems like the former would reduce duplication of effort, but I am honestly not sure how big a deal this would be.

  3. Compiler issues. See above -- a lot of this may well require the compiler to be fairly "smart" about splitting an experiment into multiple kernels to be run in parallel on different DRTIO masters. Things like inputs or outputs from specific RTIO destinations/channels should be reasonably easy to divide up, but other tasks may be a bit trickier as described above. I don't know the extent to which this proposal would have impact on/be impacted by NAC3.

  4. Loading kernels. I can think of at least two ways to load kernels onto the various masters. One would be to send the complete kernel (consisting of all the individual kernels specific to each master) to the root master, and have it transfer all of the individual kernels to each of the leaf masters (via their managers, if appropriate) over the DRTIO communication. The other way would be to have the division of the individual kernels carried out by the compiler, and have each DRTIO master be connected to the ARTIQ master over TCP/IP, so that each DRTIO master receives its individual kernel in its comms CPU this way. I think this latter method seems superior because it is less taxing on the DRTIO communication system, and would allow for kernels to be sent down to the core devices even if they are currently running a different kernel, without competing for DRTIO resources.

  5. Core device storage. Should this be just on the root master, and can be updated and queried by downstream masters as appropriate? Another possibility would be to have local copies maintained on all of the masters, and to have any changes broadcast to all masters. One could also consider the possibility of defining both "local" and "global" core device storage variables (with different scopes, either accessible by any master or just the master that sets it). Setting or accessing these variables could involve the same get() and put() methods as now, but with a flag indicating local vs global scope. One might need to worry about race conditions for setting of global core device storage variables by multiple masters. Again, I don't know to what extent we would want to attempt to engineer solutions to this versus telling users not to set globals from multiple masters in such a way that race conditions could arise depending on fluctuations in DRTIO network latency.

  6. DMA. One of the use cases that I imagine might be likely would be to have DRTIO masters that basically spend their lives DMA'ing data streams to a set of Fastino cards or the like, and otherwise just wait for branching instructions to come down the tree to change to a different DMA waveform playback. I would suggest that DMA should be strictly local, in other words any given master will only record and play back DMA sequences for the RTIO destinations that it owns, and not for ones that it manages. This would be crucial for keeping the DRTIO tree as clear of data traffic as practicable, except at the "last mile" between a DRTIO master and a data-hungry RTIO destination it owns. This may also motivate having such DMA-heavy masters be placed so that they are not managing other DRTIO masters (see next point).

  7. Efficiency. This structure would mean that kernels which rely on lots of interaction between masters would not run as efficiently (time-wise) as kernels where the masters are mostly in charge of their own things. In the limiting case of "all outputs, no branching on inputs", things should be able to run quite efficiently. It would be incumbent on the user to design their DRTIO tree and choose DRTIO masters in such a way that maximizes the efficiency with which they can run their experiments. This would probably require some good documentation and tutorials to help people understand how to design their code and their DRTIO master structure.

  8. Starting off slowly. This could potentially be a rather ambitious undertaking to implement successfully. Are there simpler variants that would achieve many of the goals but with much reduced complexity? For example, if one demanded that an experiment compile in such a way that all DRTIO masters except the root master were completely self-contained (i.e. did not interact with any other masters), would this still be useful? Inputs and branching decisions would all be the province of the root master (unless they could be carried out solely within a single non-root master). Or one could just stipulate that non-root masters are "output only" -- they can stream samples to Fastino or the like, but can't have inputs or branching (or maybe only branching that is determined for them by a message from the root master). Would this be a worthwhile intermediate solution in terms of complexity to implement, while still providing useful new capability for experiments?

I know this is all very long, but it would be really helpful to get feedback from the community on this kind of design. Does this architecture seem sensible, or is there a different one that would be better? Do you see show-stoppers or other issues not discussed above? (I'm sure people will!) What modifications would you make? What questions still need answering? @sbourdeauducq @jordens @dnadlinger @hartytp @cjbe @ljstephenson @dtcallcock @philipkent @drewrisinger @lriesebos @jbqubit

dhslichter avatar May 21 '21 05:05 dhslichter

I would make things explicit and not even try to have the compiler split kernels automagically. AFAIK doing this properly is very much an open problem in computer science and existing solutions aren't very good.

I imagine something like asyncio futures could be used to send a kernel to another device down the DRTIO tree, and then wait for its completion at a later time.

Distributed DMA and distributed core device storage shouldn't be big issues once the other infrastructure is in place.

Also, a small technical detail: I would run the whole systems from the RTIO clock (including all CPU cores), so we don't have to deal with sync vs. async FIFO issues in RTIO implementations, which would become thorny when the above gets implemented (when switching the source of RTIO events between DRTIO upstream and the local CPU or DMA). A small side benefit is a slight reduction in RTIO latency. This can be implemented in the current code already without breaking a lot of things.

sbourdeauducq avatar May 21 '21 05:05 sbourdeauducq

I would make things explicit and not even try to have the compiler split kernels automagically. AFAIK doing this properly is very much an open problem in computer science and existing solutions aren't very good.

From a syntactic standpoint, what would you propose? It seems to me that finding any lines in the experiment that involve enqueuing an RTIO event (read or write) would be pretty straightforward to assign automatically. But it's the other stuff that would need to be assigned in some explicit fashion. How could one do that without cluttering the code terribly?

I imagine something like asyncio futures could be used to send a kernel to another device down the DRTIO tree, and then wait for its completion at a later time.

I don't think that the concept of sending a kernel to a device down the DRTIO tree and waiting for it to return is really the way to achieve the kind of distributed processing I'm looking for. I think it really has to be multiple masters running their individual kernels in parallel, with occasional message passing between them as needed.

Also, a small technical detail: I would run the whole systems from the RTIO clock (including all CPU cores)

This sounds fine to me, but what about systems where the CPU core is fundamentally asynchronous (e.g. Zynq core devices)? Or as long as the FPGA fabrics of all the masters are clocked at the RTIO clock, you're happy?

dhslichter avatar May 21 '21 05:05 dhslichter

1. For things like `get_dataset` or `set_dataset`, it may be even less clear which master should be doing the work,

There's only one thing to do with RPCs: send them to the root node, since it's the one with access to the computer network.

Another idea (just brainstorming): maybe we can replace the upper layer of the DRTIO aux protocol with Ethernet, give each DRTIO node an IP address, and bridge all Ethernet ports (i.e. each FPGA would contain an Ethernet switch). Then they could RPC directly and also the switch could be implemented in gateware and always work regardless of what the software is doing. This would help with performance and also debugging. It looks a lot like White Rabbit, but without the latency overhead for high-priority RTIO packets.

sbourdeauducq avatar May 21 '21 05:05 sbourdeauducq

This sounds fine to me, but what about systems where the CPU core is fundamentally asynchronous (e.g. Zynq core devices)?

The clock domain transfer on Zynq is at the CSR level (which is why it's slow). So no problem there either.

sbourdeauducq avatar May 21 '21 05:05 sbourdeauducq

There's only one thing to do with RPCs: send them to the root node, since it's the one with access to the computer network.

What about a non-root node that could access the network? For example, a Kasli leaf node running a bunch of Fastinos would have SFP to spare for direct connection to the network. It could help with kernel loading to have things transferred in like this? Although the solution below could work (but requires implementing Ethernet in this layer -- how much yak shaving is involved there?)

Another idea (just brainstorming): maybe we can replace the upper layer of the DRTIO aux protocol with Ethernet, give each DRTIO node an IP address, and bridge all Ethernet ports (i.e. each FPGA would contain an Ethernet switch). Then they could RPC directly and also the switch could be implemented in gateware and always work regardless of what the software is doing. This would help with performance and also debugging. It looks a lot like White Rabbit, but without the latency overhead for high-priority RTIO packets.

As above -- how much yak shaving would this involve? And would its presence be competing with the real-time communications or could it be done such that it waits gracefully until a lull in the real-time traffic to send the Ethernet traffic? I guess I am more interested in how hard it would be to get this right, rather than the notion if whether it is possible in theory.

dhslichter avatar May 21 '21 05:05 dhslichter

I don't think that the concept of sending a kernel to a device down the DRTIO tree and waiting for it to return is really the way to achieve the kind of distributed processing I'm looking for. I think it really has to be multiple masters running their individual kernels in parallel, with occasional message passing between them as needed.

Those aren't problems specific to ARTIQ. Are there distributed computing frameworks that you like, see e.g.https://www.csm.ornl.gov/pvm/ ?

sbourdeauducq avatar May 21 '21 05:05 sbourdeauducq

And would its presence be competing with the real-time communications

No, like I said, this would be run on the aux channel (which yields with zero turnaround time to the main channel carrying RTIO packets).

What about a non-root node that could access the network? For example, a Kasli leaf node running a bunch of Fastinos would have SFP to spare for direct connection to the network.

Just tunnel it into the aux channel. More efficient hardware-wise and less cabling issues for users...

sbourdeauducq avatar May 21 '21 05:05 sbourdeauducq

Those aren't problems specific to ARTIQ. Are there distributed computing frameworks that you like, see e.g.https://www.csm.ornl.gov/pvm/ ?

I have not looked at these in any detail. How well things map depends on how computationally/bandwidth intensive the different processes are. For things like distributing the feeding of data-hungry Sinara cards, it might be a reasonable map to systems that farm out computationally intensive problems to other nodes and execute in parallel.

dhslichter avatar May 21 '21 06:05 dhslichter

In my previous research group we have had a similar discussion of a "centralized" architecture (current design of ARTIQ) vs a "distributed" architecture. This discussion was also held in the context of RTIO performance, specifically event throughput. In the distributed architecture, multiple independent controllers execute independent binaries and the controllers communicate to synchronize and potentially broadcast data. I think that is the solution you are describing. At that moment (5 years ago or so), most colleagues thought that a compiler that converts a single program to multiple binaries for an arbitrary distributed architecture was too complex and manually programming the different binaries and synchronization would be difficult, error-prone, and hard to debug. Maybe times have changed, I am not deeply involved in the compiler world.

The distributed architecture could be simplified by going more to an accelerator model where there is still a central controller and additional controllers behave as accelerators that execute independent binaries on demand. Accelerators and the central controller are allowed to run in parallel. That potentially adds a bunch of constraints to the accelerated sub-kernels (e.g. no communication to host and only a subset of devices available), but would be much easier to program or compile. I could see that happen in the existing ARTIQ programming paradigm.

In the end, my previous group decided for that moment to keep the fully centralized architecture and change the communication and coding between the central CPU and the RTIO system to increase event throughput as described in this paper. So maybe an approach like that could increase (D)RTIO performance without changing the whole architecture.

lriesebos avatar May 21 '21 09:05 lriesebos

The distributed architecture could be simplified by going more to an accelerator model where there is still a central controller and additional controllers behave as accelerators that execute independent binaries on demand.

Exactly, that's what I meant with the "asyncio future to send kernel" model. Maybe we can add some message passing features as well, so data can be transferred without waiting for subkernel termination. But the messages would have to be explicitly sent and received by user code.

That potentially adds a bunch of constraints to the accelerated sub-kernels (e.g. no communication to host and only a subset of devices available)

IMO the only thing that's practical is that a (sub)kernel takes exclusive control of the full DRTIO device subtree below it. And sub-subkernels are allowed.

sbourdeauducq avatar May 21 '21 09:05 sbourdeauducq

@sbourdeauducq just wondering, do you think any significant (D)RTIO throughput gains could be achieved by redesigning the CPU-(D)RTIO communication system and/or coding?

lriesebos avatar May 21 '21 10:05 lriesebos

On Zynq, maybe by using posted writes (and a counter of available buffer space which is kept near the CPU core) and making RTIO underflows asynchronous (which may annoy people). Anyway that's outside the topic of this issue.

sbourdeauducq avatar May 21 '21 10:05 sbourdeauducq

This is very useful discussion. The use case I am envisioning is an experimental setup where a bunch of cards are doing data-intensive but low/zero-connectivity tasks -- things like running SU servo loops, or streaming pre-recorded samples to Fastinos over DMA. I think that we could achieve a bunch of improvements in this regard using the more limited description described above. We also need to account for the fact that current hardware doesn't allow for highly branched trees, so it may be important to separate logical sub-master topology from physical sub-master topology. To summarize what this all might look like:

  • Single root master is the sole central controller.
  • Sub-masters can run (sub)kernels, which take exclusive control of their full sub-tree
  • Sub-tree is defined as self plus DRTIO satellites in sub-tree; any sub-masters which are physically downstream of other sub-masters in the DRTIO tree are not considered part of the upstream sub-master's sub-tree
  • Root master is solely responsible for telling sub-masters to run sub-kernels, including which sub-kernel and starting at what time.
  • Sub-kernels can run in parallel with kernel on the root master
  • Message passing is available between sub-masters and root master through explicit user code only
  • Sub-kernel return value is just information on any RTIO errors that occurred; any user data return values through explicit code only.
  • Sub-masters have local core device storage and local DMA recording and playback, all of which persist across kernels but are reset at power-up. References to these resources in the code must be explicit that they are located on the sub-master.
  • Sub-masters can have multiple sub-kernels cached on board (needs discussion of sub-kernel persistence in cache) such that more than one sub-kernel is available to be called as appropriate.
  • All masters run on the same RTIO clock
  • Routing table of sub-masters, including full description of owned resources, available to compiler as well as root master

Still open for discussion:

  • How does compilation work? Writing sub-kernels explicitly as separate code makes the compilation problem straightforward, but suffers from the fact that one may desire them to inherit various timing characteristics from the main experiment (for example, interleaving some pulses on the sub-master with pulses on the root master). Writing such code in two kernels and keeping it harmonized would be very bug-prone. Here is an alternative proposal, for comment:
    • Option: sub-kernels are written as explicit separate kernels and compiled as such. Easiest to implement, most likely (we think!) to cause buggy heartache.
    • Desired case: All code is written in a single kernel and compiled as though it were a single kernel.
    • After compilation, all instructions that touch (enqueue or dequeue events in) RTIO destinations owned by a particular sub-master are put into a separate sub-kernel for that sub-master. If the compiler works by unrolling all loops (perhaps we demand this for kernels involving sub-masters?), then this should produce a single linear series of instructions (with potential branches) with well-defined execution times and sequence (which may depend on variables not known at compile time).
    • Any instructions that do not explicitly touch RTIO destinations owned by a sub-master are part of the root master kernel.
    • Any variables whose values are not known and substituted in at compile time, and on which sub-kernel instructions (including timing) depend, must be passed to that sub-kernel from the root master by explicit code during the kernel. Failure to do so will raise a compiler error. This will probably cause some pain.
    • What happens if the main kernel branches? It would probably be important to allow sub-kernels to have different branches that can be run. And one doesn't want sub-kernels necessarily running off mindlessly.
      • My suggestion: sub-kernels are not allowed to persist across root kernel branching instructions. If a sub-master has tasks to perform with its owned RTIO both before and after a root kernel branching instruction, these tasks will be divided into two sub-kernels, one for before and one for after the branch.
      • By the time the root master reaches a branch, all running sub-kernels will have returned, and the root master doesn't proceed with execution if there are still unreturned sub-kernels. After the root master branches, it then calls for the execution of the appropriate sub-kernel(s) (which may depend on the results of the branch) on each sub-master. One would use seamless handover on the sub-masters to ensure that there aren't dead times as a result.
      • If there are different tasks for the sub-master to perform if the root master branches, then there may be more than one sub-kernel for after the branch. All sub-kernels would need to be pre-loaded onto the sub-masters, and the choice of which one to execute done by a message from the core device. This kind of structure would enable the code run on the sub-masters to be branch-dependent (which I think is important) without the sub-masters having any responsibility over branching. It should be possible for the compiler to figure this out automatically from the compiled overall kernel, I think.
      • Need to discuss persistence of cached sub-kernels on sub-masters. I propose a default of persistence only for the life of the root kernel from which they were derived. I also propose that explicit sub-kernels written as independent experiments have the option of persisting across root master kernels (based on a flag when they are compiled and loaded into the sub-master).
  • Are sub-kernels distributed to sub-masters before the start of execution of the root kernel? I think yes.
  • Sub-kernel distribution is via DRTIO aux link (potentially running ethernet? in this case, sub-kernel distribution would come from ARTIQ master, and not via root DRTIO master)
  • Sub-masters at multiple levels - I suggest that it be possible to have a sub-master downstream in the physical DRTIO tree from another sub-master. However, from a logical standpoint, I suggest that the upstream sub-master does not claim ownership or management of any of the downstream sub-master's RTIO destinations, and that the upstream sub-master is transparent (just a router) for messages passed between the root master and the downstream sub-master. So the physical topology is a tree, but the logical topology is a star, with a single root master that talks directly with each sub-master.

dhslichter avatar May 21 '21 21:05 dhslichter

@dhslichter I think you have an overly simplistic view of the compiler and the complexity of the programs getting compiled.

If all you're doing in subkernels is static and linear sequences of instructions, then you can use distributed DMA. Splitting a DMA sequence generated from one single piece of kernel code into a number of chunks to be played back on distributed devices is doable.

But automatically splitting a whole algorithm to make it distributed is extremely hard. Look at modern machines: even though they are highly parallel (look e.g. at the newest AMD Threadripper processor with 32 cores and 64 threads), exploiting this parallelism is very much left to the programmer in every practical programming framework out there (e.g. Rust, OpenMP, PVM, MPI). They don't have any implicit parallelism.

Even if we managed to make it work somehow, it would be a black box for the user and when there are performance issues or bugs, it would be very difficult to figure things out. Even type inference in the current compiler is causing a lot of problems and we want to simplify it and do it better in NAC3. Automatic parallelism is a lot more difficult than type inference.

sbourdeauducq avatar May 22 '21 02:05 sbourdeauducq

@dhslichter I think you have an overly simplistic view of the compiler and the complexity of the programs getting compiled.

I'm sure you're right. I guess I am thinking mostly about a fairly restricted set of kernels of the sorts that we often run in some experiments, rather than full generality, where I understand that yes, the compiler would be very hard pressed to cover all the potential cases automatically.

I think that a solution where one has to write explicit sub-kernels for the sub-masters would still be better than a situation with no possibility for sub-mastering. If those explicit sub-kernels can take explicit arguments from the root kernel (or just the run() method) that control their timing, then it should be possible to reduce the chance for timing not being properly aligned between root kernel pulses and sub-kernel pulses. The whole point of the sub-kernels is to be in charge of streaming out large amounts of data, probably via DMA, to their owned hardware; devices that don't need these large amounts of data can just be owned by the root master instead of a sub-master. So the sub-kernels would consist of DMA recording and playback, or just regular kernel execution with lots of pulses, where one defines some of the timings in the kernel in relation to arguments the kernel is passed. It definitely can break if you forget to change the root kernel and sub-kernels in concert when adjusting the pulse sequence, but it's still better than nothing.

I'm fine with having it all be required to be explicit, and people who need this performance will have to deal with the challenges. Perhaps as it goes along we can come up with some additional ideas from experience about how to make it less difficult to keep pulse sequences coordinated between root kernel and sub-kernels.

dhslichter avatar May 23 '21 17:05 dhslichter

@sbourdeauducq how practical would be passing of parameters (including lists, or lists of lists) to sub-kernels? Can it appear in the code as the sub-kernel just taking arguments (which would be compiled to message passing from the root kernel to the sub-kernel), or would it be better to have some lines in the sub-kernel that explicitly fetch the desired parameters from the root kernel?

I understand that probably some of this may be impacted by the way in which NAC3 is being implemented.

dhslichter avatar May 24 '21 18:05 dhslichter

Can it appear in the code as the sub-kernel just taking arguments

Yes, no big issue here.

sbourdeauducq avatar May 24 '21 22:05 sbourdeauducq

But if you mutate a list in a subkernel, the master's copy of the list will not be updated (otherwise there are issues with performance and synchronization).

sbourdeauducq avatar May 24 '21 22:05 sbourdeauducq

In the same vein, there should be a mechanism to handle mutation of objects. A safe API would be that when a subkernel is started, ownership of a set of objects that it can mutate is transferred to the subkernel, and their data are written back to the master when the subkernel completes/is joined.

sbourdeauducq avatar May 24 '21 22:05 sbourdeauducq

A safe API would be that when a subkernel is started, ownership of a set of objects that it can mutate is transferred to the subkernel

It's not easy to do though. To do it without a significant performance hit when accessing objects, it would probably require a MMU (also available on VexRiscv and mor1kx) and allocation of object groups defined according to a pre-defined subkernel structure into different MMU pages.

sbourdeauducq avatar May 27 '21 00:05 sbourdeauducq

May be worth looking into (if we can make it usable): http://www.cs.rpi.edu/academics/courses/fall14/proglang/materials/CTM4.pdf

sbourdeauducq avatar Aug 31 '21 04:08 sbourdeauducq