zenscript icon indicating copy to clipboard operation
zenscript copied to clipboard

Parallelism

Open shelby3 opened this issue 6 years ago • 79 comments

Unfortunately this discussion started in Error handling thread.

And the discussion derived from a post I made about C no longer being a low-level language because of its mismatch to the fact that all scaling must come from parallelism.

I hope that discussion will continue here instead.

How can we ever find any of our 1000s of comments of past discussion if they are buried in unrelated threads.

shelby3 avatar May 22 '18 12:05 shelby3

@keean wrote:

Actually you have that backwards, there is more information about parallelism available at runtime.This is why VLIW processors failed to outperform x86.

VLIW was based on the idea that the compiler could do more to extract parallelism from the code than the CPU, so they made a highly parallel CPU with many execution units that could run multiple instructions in parallel.

You mentioned this VLIW comparison and then VLSI/VHDL parallelism in our past discussion in the Concurrency thread. And ironically you made the same conflation of “low and high-level parallelism” which I will be correcting you on again herein.

I agree there’s more information at runtime relative to compile-time approach which is parallelizing the same algorithm.

But your thought process is incorrectly applied to what I wrote. You conflated low-level runtime, with high-level compile-time. You must compare apples to apples. Your point is true for low-level runtime versus low-level compile-time.

I did not write “more information”. I wrote “more semantic information”. The distinction apparently means nothing to you but it should.

The “more information” at runtime is limited in its character by the semantics expressed in the code. The higher level semantic information enables (the programmer, not the compiler!) to make changes in the algorithm which are impossible for the low-level speculative pipelining parallelism to achieve, because the low-level transformations cannot change the high-level algorithm. A high-level algorithm designed to achieve parallelism can obtains orders-of-magnitude scale of performance; whereas, the low-level speculative pipelining is only achieving much less than an order-of-magnitude increase and this increase is not scaling by Moore’s law over time because the CPU designers have hit a limitation of nature which can’t scale. Advocating the futility of fighting against nature is inane, idiotic, myopic, etc..

In addition to my slamdunk point above, your point about runtime information can also be applied by a compiler+runtime at a higher semantic level which the low-level pipelining can’t do. For example, with pure functions a map operation can parallelize all the operation for all elements, but the decision about whether it should or not is dependent on the ratio of the overhead of parallelization relative to runtime cost for the operation. So the runtime could measure this ratio and decide whether to parallelize the map. A compiler could estimate it, but for cases near to the margin a runtime could measure it.

Another tangential point is that when you claim that it’s too difficult for programmers to extract parallelism, you are ignore the points made in the ACM article that extant PLs lack the proper designs to extract the parallelism that they could plausibly do without too much programmer intervention, such as the aforementioned pure function example. But my point has been also that ingenuous programmers will find even more clever algorithms for parallelism.

That ACM article argued that C is no longer a low-level language because due to lacking these facilities for the compiler to automatically extract higher-level parallelism from the code which the low-level speculative pipelining cannot extract, then CPU designers are forced to create a virtual microcode pipelining hardware inside the CPU which C does not even expose to the programmer. IOW that parallelism is the only way to continue to scale Moore’s law and C encourages CPU designers to instead not scale and waste most of the transistors on trying to squeeze diminishing incremental improvements in serial performance. That is a recipe for disaster with the future of computing becoming stagnant and technology which relies on computing failing to improve. Thus it behooves us as PL designers to address this issue.

Essentially they stripped the speculative out or order hardware out of the CPU and used that space for extra execution units. This relied on the compiler to produce the parallel instruction.bundles. What they found was it could only outperform the x86 on very specific scientific computation tasks, and failed to be faster for general computation.

Because they were running serial algorithms. No where have I claimed that running serial algorithms on parallel hardware will be faster than running serial algorithms on speculative pipelining. Thus you shouldn’t write this irrelevant information which doesn’t even apply to my point.

My point was about the need to parallelize our performance critical code paths, otherwise our programs will not scale. This is a statement of fact, which none of your obfuscating BS has refuted.

I go with evidence over speculation every time.

Me too I will also choose speculative parallelism because of the evidence. At a higher semantic level. Which is for example how 3-SAT is made practical for some working sets.

And you will go with evidence? Then why do you ignore the evidence that future improvements in serial speedups are dying? And why am I repeating this for the 4th or 5th time?

I simply can’t believe you’re this slow minded because you’ve demonstrated to me in many instances that you have a sharp intellect. Why the cognitive dissonance here? Is it because and do you lack divergent thinking? Is it some obstinance issue? Are you preoccupied? Are you being lazy and not really thinking it out and relying on past conclusions you had made instead of reanalyzing them in light of my points?

shelby3 avatar May 22 '18 15:05 shelby3

@shelby3

3-SAT is made practical for some working sets.

And 3-SAT is not general computation, it is a very specific algorithm, for a class of problems, but it looks very different from general computation.

The general part of most programs is a event loop that queues sequential processes, that necessarily follow one another, or are connected by streams in the pipelined case.

The point is, I do not disagree with any of your examples of parallelism, and we don't have to wait for new hardware, we can run these algorithms on the GPU already.

I agree that a new language has to handle parallelism. There are two ways I would do this, one is vectorisation, so high level operations like matrix multiplication can use SSE vector instructions for performance. The other is by 'goroutines' like parallel tasks communicating by streamed channels.

However each go-routine is still a sequential algorithm, and fundamentally programs will be built from parallel sequential tasks.

keean avatar May 22 '18 16:05 keean

@keean wrote:

The general part of most programs is a event loop that queues sequential[ly ordered] processes, that necessarily follow one another, or are [analogously sequentially ordered by being] connected by streams in the [Unix shell paradigm] pipelined case.

ftfy.

Sequentially ordered != sequential process. There may be parallelism within the individual processes each of which are sequentially ordered. I know you know that nature is fractal, so there can be parallelism within serialism and vice versa.

Also the events in the event loop which are in the performance critical path are typically not sequentially ordered and can be performed in any order or even in parallel in some cases. The GUI events which are ordered are typically not performance critical (we do not need to scale the performance of GUIs by orders-of-magnitude).

The point is, I do not disagree with any of your examples of parallelism, and we don't have to wait for new hardware, we can run these algorithms on the GPU already.

I have not looked at what are the limitations of the designs of existing GPUs. Seems that the ones built into the CPU don’t have the fast GBs of dual-ported main memory that dedicated GPU cards have? And what are the issues with exchanging work between CPU threads and GPU threads? Are the cost barriers high enough that the two processors pretty much can only send messages to each other? I know you’re reasonably knowledgeable about the extant hardware designs, so I will defer to your knowledge on this.

But yeah, maybe our compiler should be doing more to leverage that GPU (and also the CPU’s multicore) hardware which often sits idle.

I agree that a new language has to handle parallelism.

I think my recent point can be recapitulated/reconstituted as the more parallelism we can push for in code, then the more incentive CPU designers will have to prioritize high-level parallelism instead of low-level pipelining. Hopefully resulting in scaling trending back to Moore’s law progress.

My point is we should do all we can and reduce dependence on the low-level speculative pipelining in our design decisions. For example, such as where you argued in the Error Handling thread that exceptions as return values would not be an extra performance cost because of the low-level speculative pipelining. But that design decision if taken would have further cemented the need for low-level speculative pipelining and helped kill future scaling. And upon further analysis of your proposal independent of the concerns with reliance on speculative pipelining, it became clear we didn’t need to accept that design decision.

There are two ways I would do this, one is vectorisation, so high level operations like matrix multiplication can use SSE vector instructions for performance. The other is by 'goroutines' like parallel tasks communicating by streamed channels.

Not just parallelism but also we need to diminish our reliance of the model of main memory as having latency that is low enough to keep up with a serial processor. Because this requires expending the transistor budget on complex deleterious caching mechanisms.

The ACM article also points other things we can improve to reduce for example the reliance on cache coherency, which is necessary in order to enable hardware designers to prioritize parallel computation units instead of the very costly (inefficient!) model of low-latency main memory.

For example that ACM article pointed out that if our PL allows for the compiler+runtime to reformat arrays of structures to structures of arrays, without intervention from the programmer. Also I wrote:


Let me quote what I wrote in my original post:

So the author of the article explained that instead of complex caches (and to accommodate the large working sets that the Transputer didn’t) we perhaps need an architecture where we have so many threads running that the latency to shared memory can be offset by computation these threads have to do in between accesses to main memory such that the CPU is always fully utilized. IOW, while some threads are waiting on main memory, other threads are running computation.

Note AFAIK that most GPUs cards include a lot of shared memory with a very fast, wide bus. So perhaps they’re already close to what is optimal even for large working sets.


However each go-routine is still a sequential algorithm, and fundamentally programs will be built from parallel sequential tasks.

Not always true. Again nature is fractal.

shelby3 avatar May 23 '18 01:05 shelby3

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

And the corollary is that expending it to speed up serial algorithms that are in the performance critical path of the application futile non-scaling.

shelby3 avatar May 23 '18 04:05 shelby3

@shelby3

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

I disagree with this. Speculative out of order execution is a key feature in exploiting super-scalar architectures. Without it expect software to run 2-3 times slower.

Having 4 cores does not make up for this, just look at the failure of the Power Architecture. PowerPC was IBMs attempt to follow that approach, lots of simpler cores. It was tried in the playstation2, and the added complexity of programming the thing drove developers away, leading to the dominance of the XBox, as it provided a proper PC like CPU with caches and super-scalar-out-of-order processing.

The ideas you and that article are proposing are not new, they have been tried, and they failed the test of the free market.

keean avatar May 23 '18 05:05 keean

@keean wrote:

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

I disagree with this. Speculative out of order execution is a key feature in exploiting super-scalar architectures. Without it expect software to run 2-3 times slower.

Do you not understand what I intend by “are not in the performance critical path of the application”?

It can’t be slower in any relevant way, if it is not in the performance critical path.

Whoop-de-doo if you make your highly responsive GUI more responsive and nobody can discern the speedup, because it is not code in the performance critical path. Law of diminishing returns aka marginal (non-uniform) utility.

Another corollary is that only 5% (or maybe up to 20%) of your application’s code is typically in the performance critical path, although there are exceptions. Thus the less stellar programmers will still have a lot of work they’re qualified to do that can be serial algorithms.

Btw, one (or maybe 2) core in the CPU could retain that speculative pipelining and the other 8, 16, or 24 (and increasing in the future) could then not waste their transistor budget on marketing hype. Has AMD already designed a CPU like this, such as with most of the transistor budget for the GPU on the same die?

shelby3 avatar May 23 '18 10:05 shelby3

Yet another example of parallelism I think the compiler can extract without intervention from the programmer.

shelby3 avatar May 24 '18 05:05 shelby3

@shelby3

It can’t be slower in any relevant way, if it is not in the performance critical path.

It's latency that's important, so speed is important, but not throughput. Many of the highly parallel systems have high throughput but also high latency, fine when you are doing a batch job, but bad for an interactive desktop computers where low latency is critical.

keean avatar May 24 '18 06:05 keean

@keean wrote:

It can’t be slower in any relevant way, if it is not in the performance critical path.

It's latency that's important, so speed is important, but not throughput. Many of the highly parallel systems have high throughput but also high latency, fine when you are doing a batch job, but bad for an interactive desktop computers where low latency is critical.

I did not propose to employ parallelism in code that is not in the throughput performance critical path. Thus latency would not be impacted in the typically up to 95% of the code of a program that is not in the throughput performance critical path. Since that 95% of the code can only run serially then it can’t leverage the other cores. Thus one or two cores with the speculative pipelining is sufficient (on the desktop and servers will eventually discard the speculative pipelining entirely once we improve the software). The rest of the transistor budget can be allocated to parallelism which is the only way to scale.

You could argue that running multiple copies of the program such as for a webserver where each request runs a separate copy of the program, thus would need every core to have speculative pipelining in order to minimize latency. But I would argue that a 50% reduction in CPU-bounded latency is not worth killing the scaling (by Moore’s Law) of the number of cores and thus number of requests that can be served per CPU. This is why Xeon (which sacrifices clock rate for more cores) instead of desktop Haswell/Broadwell architecture runs on most servers. So Xeon is sacrificing latency for parallelism.

Also Ryzen is kicking ass on Intel precisely because they made the right tradeoff for where computing is headed, “Ryzen CPUs offered stronger multi-threaded performance and weaker single-threaded performance relative to comparable Intel CPUs.”

Sparc was too geared towards parallelism too soon. The software had not caught up yet. AMD is probably hitting the sweet spot for the transition phase. They decided to emphasize their strengths in design and parallelism (GPUs) and outsource the foundry (eventually leveraging the coming strength of China in foundries!). AMD is following Android’s model and Intel is following Apple’s model, so I expect AMD to end up with more market share than Intel unless they fuck up again to snatch defeat from the jaws of victory. Also AMD has rising capital to expend because of the rising demand for GPUs for mining cryptocurrencies. But eventually something like Sparc or GPUs will come back much stronger in the future when the software adapts to more parallelism. And look who continues Sparc now. Japan! The Asians are going to kick ass in the future.

Also the latency is typically not bounded by the CPU but rather bounded by the persistent storage media, e.g. SSDs. So there is sometimes (or maybe even typically?) ample overhead to allow for parallelism and discard the speculative pipelining without increasing latency, although this can vary in each situation. You could attempt to make the argument that latency of the CPU and persistent storage are additive, but smart coders know to mask the CPU latency in the persistent storage latency with interleaving. Ditto the masking the CPU throughput and latency with memory latency via interleaving computation with read/write accesses.

shelby3 avatar May 24 '18 14:05 shelby3

I farted:

If for example that is an image filter, then split up the image file into tiles and parallelize the entire filter chain over tiles. If necessary overlap the tiles slightly if the filter input domain exceeds its output domain.

shelby3 avatar May 27 '18 15:05 shelby3

Some parallelism discussion in the Concurrency issues thread #17.

shelby3 avatar Jun 02 '18 06:06 shelby3

@shelby, Do you know how the ram capacity of a normal PC Board will increase in future? Massive multi core increase for the future is one good thing but to run multiple instances simultaneously wee need also more ram capacity.

sighoya avatar Jun 06 '18 11:06 sighoya

@shelby3 Threadripper!

https://www.anandtech.com/show/12906/amd-reveals-threadripper-2-up-to-32-cores-250w-x399-refresh

It's still speculative out of order with up to 32 cores.

keean avatar Jun 06 '18 13:06 keean

@keean wrote:

It's still speculative out of order with up to 32 cores.

64 hyperthreads on a desktop! Wow.

Yeah I had realized that removing the speculative out-of-order doesn’t gain much. But the point remains that further increases due to Moore’s law are going into cores now. There’s no more that can be squeezed out of OoO exection.

Remember I wrote recently in the Concurrency issues thread #17:

So again I ask you, “Do you know what percentage of the silicon of Intel CPUs are for the speculative execution and register rewriting as contrasted with the caching?”

Apparently only about 10 – 15% of the transistor budget is expended on OoO execution (percentage of TPD power budget may be higher):

https://hothardware.com/news/amd-ryzen-die-shot-cache-structure-architecture-advantages-exposed

https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Core_2

https://en.wikichip.org/wiki/intel/microarchitectures/sandy_bridge_(client)#Core_2

So apparently there will be not much sense in removing for purposes of increasing number of cores.

Although it would still perhaps be beneficial to turn it off when we don’t need it (i.e. in I/O bound concurrency) in order to conserve power consumption (important on mobile and in data centers), and when we need our code to surely free from timing attacks that might be lurking in the nondeterministic out-of-order speculation.

Nevertheless the point remains that transistor budget is increasing faster than transistors can be consumed for extant core counts that would increase performance. IOW, the OoO execution and superscalar features can’t improved. So Intel and AMD are forced to use those transistors for increased core counts.

So more extensive use of concurrency and parallelism is unavoidable.

shelby3 avatar Jun 07 '18 06:06 shelby3

@sighoya I don’t know the plans about DRAM. I presume they’ll scale everything up. It’s the latency which can’t keep pace. But with massive concurrency then we can coalesce main memory accesses as GPUs do.

@keean wrote:

But we need the L3 cache.

Well actually one day in the distant future when the concurrency is massive, then can do coalescing of main memory (and also persistent storage) accesses with concurrency then we won’t even need L3 cache, but that is I think too far into the future to design for now.

For that sort of use C + OpenMP would be a good target.

Found this:

AMD told us you should theoretically be able to build a Threadripper system with up to 1TB of memory when 128GB LR-DIMMs are used—hopefully enough to hold you for a few years. (AMD also says Threadripper should technically be able to support up to 2TB of RAM, although the company hasn’t validated this because there are no DIMMs that support the capacity yet.)

shelby3 avatar Jun 07 '18 06:06 shelby3

@shelby3 Think of it like this, CPUs like threadripper are best when you have threads that need to take different paths, or run different tasks. GPUs are best when all the thread run in lockstep, all taking the same codepath in parallel.

On a CPU you might lauch parallel threads, and then wait for them all to complete (a rendezvous) on a GPU you know all the tasks finish at the same time. So a GPU is going to be more efficient for loop parallelisation (as long as its a big enough chunk of work to overcome the cost of launching the task and collecting the results). A CPU is going to be more efficient at having one thread per service connection dealing with client connections etc.

keean avatar Jun 07 '18 06:06 keean

@shelby3 A good parallel language should have different syntactic structures to represent these different kinds of parallelism, so the programmer is clear how things will execute and how much they will cost (the principle of visibility, and not paying for what you don't use). You could automake this with heuristics, but then the programmer has to play chase the heuristic, where small seemingly inconsequential code changes can tip the balance and result in unpredictable changes in performance.

keean avatar Jun 07 '18 06:06 keean

@keean made me aware of this:

Microsoft is working on a new paradigm for increased parallelism:

Now Microsoft ports Windows 10, Linux to homegrown CPU

Explicit data graph execution - Wikipedia

Also remember I predicted this earlier in this discussion about the death of speculative superscalar architecture and shared caching between threads:

OpenBSD disables Intel’s hyper-threading over CPU data leak fears

https://www.itwire.com/security/83347-openbsd-chief-fix-for-new-intel-cpu-bug.html

And remember that there is no way to disable timing information (can merely increment a variable in a loop):

https://gruss.cc/files/fantastictimers.pdf

The only 100% safe way to stop code from software timing is to make all code run in a continuously randomly changing speed of clock, which isn’t desirable nor realistic.

OS preemptive threading may to some extent foil the use of timers which rely on incrementing a variable and non-preempted execution, but the devil is in the details. I don’t know the minimum interval needed to measure a timing attack with software timers. And if the OS preempts too frequently then overhead increases, which eats into the performance that was supposed to be gained from those hardware performance features (speculative execution and shared caches) which make the timing attacks possible. Thus I’m not sure that removing hardware timers is a complete solution. Also some software needs high-precision timers. Thus I still propose that in the future we will need the option to have code request that it be run only on cores with that vulnerable performance enhancements turned off and/or to dictate which other code can run along with it in the other hyperthreads which share the same vulnerable performance enhancements. Code that prioritizes security over maximum performance or which employs suitable parallelism may want to turn those vulnerable performance enhancements off.

EDIT: if you want some flavor of why correlation attacks can be so insidious and we should not think we can so easily squash them just by removing a hardware counter.

shelby3 avatar Jun 21 '18 07:06 shelby3

@shelby3 I am not sure, I think for most people performance is more important than security. I mean you can still exploit "rowhammer" on DRAM, but people are still using DRAM :-)

keean avatar Jun 21 '18 08:06 keean

As I wrote previously, I think eventually we will end up with a way for applications to turn it off on a per core basis so that applications can decide whether they need to prioritize security (and battery life) over serial performance.

shelby3 avatar Jun 22 '18 12:06 shelby3

I want to make my stance clear because I learned in private messages that some misunderstood my stance. In my post that responded to Eric Raymond’s blog post and claim about the invariant of SICK algorithms requiring serial performance and the discussion here in Github that has ensued, I didn’t intend to imply that we will ameliorate mathematical complexity theory. Rather my stance is that we will replace the need for those algorithms by solving problems with different strategies which can be parallelized and that do not require those algorithms. My stance is that we have no choice and humans necessarily rise to the challenge and are ingenuous when the market demands them to be.

We may also find algorithmic strategies to apply to existing SICK algorithms which leverage parallelism, but that wasn’t my central thesis.

shelby3 avatar Jul 09 '18 09:07 shelby3

@keean can you concur or correct my thought that §4. Challenge #‍1: Traffic on pg.3 of the paper Why On-Chip Cache Coherence is Here to Stay (which also appeared in a journal) contains an error in logic:

We now tackle the concerns regarding the scalability of coherence’s traffic on the on-chip interconnection network. To perform a traffic analysis, we consider for each cache miss how many bytes need to be transferred to obtain and relinquish the given block […] In a coherent system, when a core reads a block that is shared, the coherence protocol may need to forward the request, but to at most one core (and thus the traffic for each read miss is independent of the number of cores). However, when a core incurs a write miss to a block that is cached by one or more other cores, the coherence protocol generates extra messages to invalidate the block from the other cores. These invalidation messages are often used to argue for the non-scalability of cache coherence, because when all cores are sharing a block, a coherent system must send an invalidation message to all other cores. However, our analysis shows that when sharers are tracked exactly, the overall traffic per miss of cache coherence is independent of the number of cores […] Consider the access pattern in which a block is read by all cores and then written by one core. The writer core will indeed be forced to send an invalidation message to all cores, and each core will respond with an acknowledgment message (i.e., a cost of 2N messages for N sharers). However, such an expensive write operation can occur only after a read miss by each of those cores. More generally, for every write that invalidates N caches, it must have been preceded by N read misses. The traffic cost of a read miss is independent of the number of cores

Although they’re correct that the bus traffic overhead of the invalidation messages is constant relative to the traffic of the read misses, the cost of coherency for per-core private caches is that every write invalidation is by definition going to cause an additional read miss for each core that still needs access to anything (that wasn’t even invalid but was included) in the block (aka cache line) that was invalidated. IOW, the overall increase in traffic cost for per-core caches due to write invalidation, doesn’t scale with the number of cores. Or more simply stated that sharing mutable data between cores that have private caches doesn’t scale.


EDIT: I received an email reply from one of the authors of that paper. He concurred but pointed out that my point is only true when there exists the “unusual situation” of “false sharing” of “unrelated data … collocated on the same cache block, leading to needless coherence traffic.” His opinion is that “good data placement, false sharing shouldn't be a big problem.” I responded that I disagree because there will be cases where the record of related fields being written to also needs to be read again, so it’s not necessarily due to false sharing and thus not necessarily unusual or uncommon. Thus I continue to believe that the discovery by mostly @keean with my follow-up insights is very important. However I will say I don’t have any statistics or functioning models to check the frequency of this issue and its quantitative impact.


I don’t necessarily disagree with the title of the paper though once writable shared data is disallowed in the system. There’s still some coherency required…

If we adopt @keean’s suggestion in the Pointers #38 issues thread that all data shared between threads must be immutable (aka read-only), then my realization is that the only time that per-core caches need to be invalidated is when an immutable data structure has been deallocated and the memory it occupied is allocated anew for some other data. Because the invalid data remaining the cache will not be accessed by any program which respects the allocation until the same area in shared memory has been allocated anew. The allocation could optimize this further (so that other data in the same block which is still valid as cached is not prematurely invalidated) by refusing to allocate anew the memory that the deallocated immutable data structure formerly occupied until all of the surrounding memory which comprises an entire block (regardless whether that entire block is loaded into any per-core cache) that will be invalidated is also deallocated. Thus the invalidation broadcasts (to the per-core caches) would only happen when the entire block in shared memory is no longer in use and has accepted a new allocation.

This does conflate software and hardware though. One way to implement it is to have a system call for invalidating blocks and rely on software to manage this correctly for its virtual memory pages.

The conclusion section of that paper states:

First, this paper does not do detailed architectural simulations with specific benchmarks. We instead show that coherence’s per-miss traffic is independent of the miss pattern and number of cores. Although less precise than detailed simulation, our results are actually more robust as they are not limited to the specific benchmarks studied.

They actually failed to consider the holistic issue, which architectural simulations under scaling would presumably have illuminated. They considered each read miss independently without considering (and thus implicitly presuming not) that write invalidation drives more read misses thus amplifying traffic.

Second, we did not compare against multicore chips without caches or without a shared address space.

They also failed to consider the case of an Erlang-like model wherein all shared data is immutable.

P.S. I emailed the authors of that paper a link to this post.


Additionally immutable shared data can facilitate less complexity for cache-to-cache transfers as explained in 3) Cache-to-Cache Transfers of subsection B. Coherence Protocol Independent Design Considerations in §III. IMPLEMENTATION CONSIDERATIONS on pg. 7 of the research paper An Evaluation of Snoop-Based Cache Coherence Protocols:

3) Cache-to-Cache Transfers: To improve the timeliness of requests for data that is cached somewhere, it may be useful to allow another cache to supply the data rather than always going to memory. To support cache-to-cache transfers, each BIU must have a way of indicating whether or not its cache has the data and is able to supply it. This can be accomplished by means of a single wired-OR bus control line, which is asserted by each BIU if its cache is able to supply the data. The memory controller will also know that it should not supply the data if this control line is asserted. If several caches are able to supply a copy of the data, there must either be a way of selecting one of those caches (by using a priority or ownership scheme, for example) or it must be guaranteed that all caches will put the same value on the bus. In [1], it is noted that this additional complexity and the potential for slowing down the processors that have to provide the data from their cache has resulted in a limited use of cache-to-cache transfers.

Another issue with cache-to-cache transfers is whether or not one of the caches has the cache block in question in a Modified state, which can be indicated by another wired-OR bus control signal asserted by the BIU. If one of the caches has a modified copy of the block, it should not only be the cache that is chosen to supply the data, but it should also inhibit memory from supplying the data since memory has a stale copy [1][3]. As the cache is supplying the modified data to the requestor on the system bus, it is possible to write the modified value back to memory at the same time (as opposed to writing it back in a separate bus transaction). While this saves a bus transfer and improves bus performance, it has the additional complexity of having three cooperating members on the bus [1].

And cache-to-cache transfers provide power-savings as explained in subsection F. Energy Efficiency of Cache Coherence Protocols of §II. SURVEY OF EXISTING CACHE COHERENCE PROTOCOLS on pg. 5 of the same research paper. And also explained in §The case for cache coherency of the article How Cache Coherency Impacts Power, Performance.

Read-only shared data is actually faster and uses less power on existing systems which employ the MESI protocol.

shelby3 avatar Jul 23 '18 09:07 shelby3

This post contains a shocking (at least to me) discovery that Java and Go multi-threading with global heap, tracing GC is entirely incompatible with scaling multi-core. Scaling multi-core is entirely incompatible with multi-threading that shares mutable data between caches. Additionally since tracing GC is incompatible with scaling multi-threading unless each (or small group of) threads has its own separately collected heap, because as I explained that collector which isn’t instantaneous can’t be shared between too many threads because it would pause too many threads and make threads dependent on the good allocation behavior of other threads. So scaling multi-core will require reference counting for immutable objects shared between caches (and the immutability will aid in probing and collecting acyclic references, while not stalling any threads during such cycles detection and collection).

The GC of objects which aren’t shared between caches must be congruent with a viable general purpose programming model. Intel’s Xeon Phi (aka Larrabee) suffered from not sufficiently addressing these requirements2. One option is Rust’s zero overhead 100% stack allocation with its lifetime and exclusive borrowing model providing safe reentrancy. Code would be grouped to be run only by hardware threads sharing the same cache (to avoid write-back and cache-to-cache transfer coherency overhead). Another option is similarly to restrict each logical grouping of code to hardware threads sharing the same cache, but employ a separate GC’ed heap for each said grouping. For example, each green thread could have its own separate tracing GC’ed heap. Note this means a M plurality of work stealing green threads per core that has N = # of hyperthreads in M:N threading. However, this latter option lacks Rust’s zero overhead abstraction compile-time provably safe thread reentrancy; thus suffering the race condition (e.g. live and deadlock) errors and runtime overhead of locking. (EDIT: but this deficiency is solved with Actors.)

Additionally for cache-to-cache transfer efficiency and so that lightweight (e.g. green) thread context switches could occur to mask latency of such transfers, then ideally the programming model would enable the thread scheduler to know via software that such a transfer is required.

Serial core performance can’t scale and many-core doesn’t scale with complex cache coherence. At least it doesn’t scale if writable shared data is employed by the software. And if all the software will be modified to not use writable shared data, then the complex cache coherence circuitry is wasted silicon. In that case perhaps a less complex or even software driven coherence could be employed instead. If all cache-to-cache transfers are only needed when passing an Erlang-style, Actor model message from a thread on one cache to a thread on another cache, then the software can tell the hardware the source and destination caches for the transfer. (Note this means a plurality of Actors per core running on top of M work stealing green threads per core that has N = # of hyperthreads in M:N threading). Remember per the prior post that cache-to-cache transfers are needed so as to avoid (and if not read-only or was newly created, then also avoid the write-back to and) the load from main memory which is expensive both in terms of latency and power consumption.

The Rigel design offers some rough estimate of the improvement in computation density (e.g. FLOPS/mm² of silicon) by removing the:

  • order-of-order speculative execution
  • deep pipelines
  • complex cache coherence
  • cache-to-cache transfers
  • SSE vectorized SIMD
  • virtual paging, virtualization, etc (that a GPU doesn’t have either)

It was modeled on a 45 nm process. It has 1024 RISC cores (with two-wide issue superscalar ILP), 15 MB total SRAM cache, operating at 1.2 Ghz with a 100W TDP. Compared to the Intel i7 Lynnfield on a 45 nm process which achieves only 4 CISC cores with 8 MB total SRAM cache operating at 2.8 Ghz with a 95W TDP.

Ignoring latency differences and considering only throughput density1, I doubt the Lynnfield in 2009 had the 180 instructions in flight simultaneously that the current modern Intel processors has. For one thing, Lynnfield didn’t even have hyperthreading. Let’s presume maybe it had 40 in flight and the Rigel only 2. So unless 1024 ÷ (40 ÷ 2) ÷ 4 = 13 RISC instructions are required to match the work achieved by every CISC instruction, then the Rigel may have a higher compute density than the Lynnfield. Also the in flight instructions on the Intel also include some wasted computation that will be discarded by the speculation.

The Rigel paper discusses some of the other variants2 of parallelism throughput density designs which have been attempted.

One of the key distinctions1 is between the GPU vectorized (aka SIMD) model which is able to mask main memory latency with a massive number of cheap context-switch threads (affording an order-of-magnitude higher memory bandwidth by trading off increased latency) and the non-vectorized (aka MIMD) general purpose programming model which prioritize serialized low-latency and serialized performance at the expense of lower computation density and higher power consumption per work completed (i.e. lower overall efficiency). The latter has an order-of-magnitude lower main memory bandwidth with lower but still high latency (c.f. also) requiring large L3 cache to bring average latency down sufficiently for the serialized performance. As explained in the Rigel paper, the former is not suitable as a fully generalized programming model .

The following is an idea for a variant of a Rigel-esque design (yet with software driven cache-to-cache transfers and cache coherence) that replaces the L3 (and L4) cache with a solution that has comparable latency by masking most of the latency of main memory reads. Lightweight green threads are a suitable paradigm for masking macro-scale latency1, but they aren’t efficiently preemptive enough to generally mask main memory latency. But if main memory latency will only significantly occur (i.e. except for cache spill over eviction for non-shared objects, which should be minimized by the optimizing programmer) on accessing reference counted shared objects (i.e. shared between caches) which probably mainly comprise reading those reference counted shared objects which are cache-to-cache transfers (perhaps with software driven cache coherency driven by Erlang-style, Actor model message passing). The compiler is enabled to insert a green thread check-point for preemption by the memory controller (which ameliorates the lack of efficiency). Contrary to the incorrect association of green thread context switches to kernel thread/scheduler context switches, green thread context switches are low latency. At a function boundary, they’re only slightly more than the ~30 cycles latency of a jmp instruction because only the SP register has to be saved+restored (given that the other registers and flags are invalidated at the boundary. But in this case the preemption would not be at a function call boundary, so all the registers and flags have to stored and restored, which may double the latency. The memory latency is ~100ns (on CPUs and ~400ns on the CPU) which on a 2.6 Ghz CPU is 260 cycles. But this really needs hardware support that the software can query, so in case the main memory being accessed is already cached (which the software wouldn’t know) then the unnecessary context switch would be avoided. The extra latency for storing and restoring the other registers and flags could sometimes be eliminated by prefetching to the cache at the start of the function containing the the main memory accesses. No registers nor flags are yet in use at the start of a function.

Hardware-based cache coherency (which is required without this software directed coherency we’re contemplating) obfuscates the true costs of the hardware form of coherency.

1 Sufficient threaded parallelism can mask latency so that hardware is not idled. The GPU achieves this with a huge register file (RF) and putting 32 threads in a contiguously context switched warp, so context switching has low overhead (and also reduces the control circuitry needed per execution unit). The modern i7 hyperthreading employs dedicated registers for the hyperthreads which share execution pipeline ports and run simultaneously. Thus hyperthreads although masking some latency are also intertwined with a focus on serial performance. Green thread work stealing which can run on modern CPUs without hardware support employs compiler generated checkpointed pre-emption (not the same as full pre-emption) which could probably be made more efficient with dedicated hardware support. Green thread context-switches are reasonably efficient compared to OS threads, and especially where context switched at function boundaries, but only suitable for masking software macro-scale latency blocked operations, which does not include low-latency main memory accesses. They may be able to also mask efficiently the latency of waiting on a mutex lock to be freed without hardware assistance.

The GPU trades off latency in every facet for increased parallelism by eliminating synchronization assumptions for caches and global memory, as well as for example reducing the power consumption (also) by designing a much higher 24 cycle write latency (archived) (c.f. also) on the register file (RF) than the registers for each core on a GPU. The RF (and shared memory) is shared by all threads in the block so a thread context switch is very low overhead because unlike the CPU no registers needed to be saved.

The GPU’s shared memory also has a much higher latency than L1 cache on the CPU and doesn’t serve the same function as a CPU cache. Also the shared memory is banked (where each 16 or 32 successive words are successive banks) so that if threads in the same block read a different location from the same bank on the same clock cycle then they will run serially instead of in parallel. All 32 threads of a warp must be unblocked in order for the warp to execute, but even if they are unblocked they aren’t guaranteed to execute uninterrupted. Note, given 32 threads attempting to uniformly distributed randomly read from 32 banked shared memory, probability math informs us that 63% of the threads would run in parallel every cycle. Thus in such a divergent, non-vectorized access pattern would require four (4) times 24 cycles latency for each step forward of the warp (thread block) that requires all 32 threads to access the shared memory, because (0.37 ^ 4) × 32 < 1 so that less than one (1) thread is blocked for each such step (since all threads in the warp must be unblocked in order for the warp to execute). This a reason the GPU is not suitable for general purpose programming.

So on the GPU need to mask latency by either running many more than 8 threads per SM (i.e. “More concurrent threads”) and/or serially queuing up in the RF many copies of each step of an algorithm, before processing the next step for each of those copies (i.e. “More independent instructions (ILP) within a thread”).

There’s no cache coherency on the GPU. Even the GPU which has L1 and L2 cache of main memory (actually L1 may optionally be configured only for caching local memory which is what the RF spills over into), requires a threadfence() for global coherency of writes to main memory. Even communication between warps requires a shuffle paradigm.

2 @keean had mentioned to me the Xeon Phi as being an example of multi-core that failed to be as performant per watt as a GPU and failed to match the modern Intel CPUs on serial performance. But that was based on Larrabee which was originally intended to compete with a GPU yet also remain more generally programmable. The Xeon Phi has hardware coherent cache and out-of-order execution! So it’s not surprising that it hasn’t really solved any major need. It isn’t optimally designed for maximum general purpose programming parallelism (c.f. the Actor proposal in the next post) nor for vector, SIMD programming. And of course isn’t optimal for serial algorithms either. Thus it excels at nothing.

shelby3 avatar Jul 24 '18 23:07 shelby3

Actors as the paradigm for parallelism

(note having very bad whole body inflammatory attack today with a severe headache, forehead on the keyboard fatigue, mental confusion, so please excuse any lapses in my elocution and logic below)

@keean and I had discussed in 2016 Actors for example the Concurrency issues thread #17.

@keean and I were recently discussing in private messaging, the read-only sharing via messages for Actors and before that I was initially negative about Actors because they aren’t compatible with a shared memory model exposed in PLs and they don’t inherently resolve all concurrency conflicts (but neither does any paradigm)1.

Analogous to comments made by others when comparing JVM to Erlang’s VM, I was thinking it’s necessary to have shared memory to for example support a global UTXO hashmap where keys are the SHA256 transaction ids and the values the (pointers to the) a UTXO (i.e. an unspent transaction output) record. There’s a related Q & A How does shared memory vs message passing handle large data structures?. There’s also an explanation of the paradigmatic inferiority of shared memory versus message passing. Message passing also amortizes better (c.f. also) over the long-term and higher scaling. Erlang has proven to be excellent for creating web servers for example.

@keean pointed out why not just make every UTXO record an Actor. I initially thought it would increase the overhead, but as we discussed the details we both ended up realizing there would be no overhead. I contributed the idea that even a hashmap could be modeled with Actors by making each bucket an Actor which contains for example a linked-list of values. So after deriving the bucket address from the key, the message could be sent to the Actor at that address to check for the value in the bucket or add/remove a value to/from the bucket. For example, if the pointer to the linked-list for each bucket is stored in an array and every Actor for those buckets employs the same code for processing an incoming message, then there’s no extra per-bucket storage needed for the Actor. Simply call the Actor’s function with input arguments consisting of the message pointer and the aforementioned bucket address. The objects pointed to for the message and the linked-list are of course immutable. The Actor modifies the linked-list by creating a new copy with the required modifications and then changes the pointer at the bucket address of aforementioned array.

Note when messages will cause the Actor to have side-effects, then the message must be guarded by a mutex lock to prevent Actor function reentrancy. Thus each Actor really should have pure and impure variant of its function. The impure variant blocks on the mutex lock. These mutex locks can be 1 bit for each Actor when the Actors are in an array (because these adjacent Actors allow for a bit field for these mutex lock bits). But note this mutex lock would be required anyway for mutating any shared memory data with multiple threads, so the mutex bit isn’t additional overhead due to Actors.

@keean pointed out that there’s no need to queue the messages and instead employ the mutex lock for blocking. He also pointed out that no need to store a stack pointer (SP) for each Actor because there’s no context-switching going on at the Actor abstraction and each Actor should return to the top-level of its stack when returning from its message processing function. The code sending the message (which may be inside of another Actor’s message processing function) calls the Actor’s function and thus there’s no context-switch required.

In order to simulate a message queue (so that the calling Actor can complete instead of blocking), have the called (impure) Actor queue it and return. The called Actor must have registered itself to be called periodically by a thread pool (or registered as callback such as of another Actor) to process its queue.

The other key insight I had is that these Actors can run atop of green threads. So for example when blocked on the message mutex lock, the green thread can context-switch. @keean initially had the thought that green threads don’t preempt, but I explained (c.f. footnote 1 in my prior post) they can be made to software “preempt” (aka cooperatively preempt) at key check-points where they can block and Actor message passing calls would be one of those check-points.

If the called Actor is only executable by threads running on another core (or group of cores if more than one core per software-driven-coherency cache), then green threading can context-switch from the calling Actor (to another Actor on the same core / cache group). Except if the return type of the called Actor is () (aka in C is void) (presuming that failed delivery of sent message is an exception although note that message delivery is not assured in the Actor model yet that may not prevent an exception on failed or timeout yet the timeout will not be definitive in that it could have been delivered) then the calling Actor can continue immediately (presuming delivery is guaranteed) while the sent message is queued by the green threading on the called Actor’s core / cache group. Actors shouldn’t presume any semantics based only on the timing of the return of calling (i.e. sending a message to) another Actor because Actors are supposed to have autonomous state machines so that they can tolerate complex dominoes failure due to indeterminacy over the timing of delivery or failed delivery of message.

Note although we’re describing the model of how Actors work locally on the same CPU, the message passing Actor model also scales across the bus and the network. For example, the UTXO records could be split up across servers in a cluster.

1 There’s no panacea paradigm which will prevent all cases of dead and live locks, i.e. the solution to the dining philosophers problem requires omniscience. Neither the Actor model, immutability, nor Rust’s exclusive mutable borrowing will eliminate all opportunities for dead or live locks. For example, the issuer of a message could be waiting on a response from the recipient of the message in order to complete some operation, yet the recipient of the message could (even if transitively) end up waiting on the issuer before it can complete its operation and reply to the issuer.


In addition to this amazing insight above, I also realized when writing my prior post that compile-time, provably safe, zero overhead allocation will not require the onerous Rust lifetime analysis2 for Actor function code, because these always return (to the top level of their stack, thus total deallocation of all non-shared objects) and per our model shared data will always be reference counted.

So a simple model is any object which the code takes the address of and passes the pointer in a function call or as a return value, forces that object to be allocated on the bump pointer young object heap. All the other objects get allocated on the stack. The young object heap can be entirely collected by simply resetting the bump pointer when the Actor function returns (to the top level of their stack)! Voila! Amazing paradigm-shift! Note we should also try to minimize the number of duplicate objects allocated on the bump pointer heap.

AFAICT, although a very simply concept, this last insight of mine is a paradigm-shift (a technological tour de force) of epic proportions in terms of impact! The performance of Rust and C++ without the clusterfucked complexity.

Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack or the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counted object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller of the function.

2 Which is especially undesirable not just because of the tsuris but because Rust can’t model all (c.f. also) forms of lifetimes and thus is inflexible and incomplete.


Someone commented and I replied:

that all sounds very Erlang-ish, but I guess they are missing your type system constrains

Erlang has process-level context switches, which is very preemptive (although apparently also cooperative) but very heavy weight overhead. Thus, can’t eliminate the L3 cache for scaling multi-core as I proposed in the prior post. Also AFAIK Erlang doesn’t guarantee the queue-less message passing that @keean and I devised which is really critical to lowering the overhead of Actors to zero as I explained and also critical for implementing software cache coherency as I explained in the prior post. Also Erlang does not have the stack allocation I proposed and instead has automated generational GC which I have pointed out is incompatible with multi-core scaling. Yes Erlang and Actors are the inspiration, but we have proposed significant innovations and yes also the Erlang PL is lacking our ideas for a type system and other features we’d like to have for a complete PL such as modules. Also I think I with Zer0 will likely first target Go as the compile target and VM. Although in this case the reference counted objects (as declared in the code) would actually be garbage collected by Go’s GC on the Go compile-target. The important point is that Zer0 can be compiled in the future to more performant code and will support a multi-core scalable future.

Note by targeting Go initially and cooperative preemption (instead of Erlang BEAM VM’s process-level preemption which is apparently also partially cooperative), we could as compared to Erlang (c.f. also) better integrate SIMD and other assembly-level optimizations without corrupting the runtime as NIFs can do in Erlang. We’d still need to compile cooperative preemption into the SIMD code (c.f. footnote 1 in my prior post).

The Actor-like design contemplated in this post would rectify the following complaints about Erlang:

One thing Erlang isn't really good at: processing big blocks of data.

He means things like decoding mpeg data. There is too much numerical computation which Erlang is not optimized for. If the processing just involves shuffling big blocks of data from one place to another, then Erlang is quite good at it. (Files to TPC Sockets, etc)

You can't update shared blocks of data (there are no pointers in Erlang) and hence data must be shuttled across processes which in turn translate to inefficiencies.

shelby3 avatar Jul 25 '18 05:07 shelby3

In response the Quora question What programming languages do software engineers use in 2018?, this answer is more evidence to think that the proposed Actor model in this thread is the future:

When Salesforce bought Mulesoft for $6.5 billion, every programmer should have sat up and taken notice. This event adds to other recent API-related developments, especially from IBM, that clearly signal that the new programming paradigm is based on APIs and microservices.

Websites are transitioning from standalone collections of web pages to dynamic microservices hosted in the cloud. This simply means that what used to be a website is now a collection of independent programs running in the cloud. These independent programs “talk” with each other and with other services in the cloud via APIs.

For example, the veterinarian where I take my dog has a traditional static website with all the usual information about hours, personnel, specializations, etc. There are links to other veterinary websites where I can create an account and buy dog food, medications, toys, etc.

Now, with the proliferation of APIs, this architecture is transformed: the formerly static website interfaces directly with other websites using APIs, so the partitioning among websites becomes transparent. Soon I will go to my veterinarian’s “website” and all services and products will be available in one place using a single domain name and a single customer account. There will no longer be hyperlinking among individual, standalone websites (at least insofar as the visitor can tell). This integration of the WWW is happening really, really fast.

shelby3 avatar Jul 31 '18 11:07 shelby3

This might interests you.

sighoya avatar Jul 31 '18 13:07 sighoya

@sighoya ty. Very succinct blog. Good choice. Yeah I was thinking what @keean and I were contemplating is sort of a mix of CSP and Actors, because we are thinking to block on sending and guarantee message delivery. Yet we were also talking about addresses and not channels to make it more efficient.

That is an excellent point that blog reminded me that one of the key aspects of Actors (which I had also mentioned recently and in the past to @keean) is that they don’t guarantee message delivery nor order of delivery. So the Actor model forces to design for network failures.

So we’ll need traditional Actors as well as our proposed optimized form for optimizing local resources.

shelby3 avatar Jul 31 '18 14:07 shelby3

I wrote:

In addition to this amazing insight above, I also realized when writing my prior post that compile-time, provably safe, zero overhead allocation will not require the onerous Rust lifetime analysis2 for Actor function code, because these always return (to the top level of their stack, thus total deallocation of all non-shared objects) and per our model shared data will always be reference counted.

So a simple model is any object which the code takes the address of and passes the pointer in a function call or as a return value, forces that object to be allocated on the bump pointer young object heap. All the other objects get allocated on the stack. The young object heap can be entirely collected by simply resetting the bump pointer when the Actor function returns (to the top level of their stack)! Voila! Amazing paradigm-shift! Note we should also try to minimize the number of duplicate objects allocated on the bump pointer heap.

AFAICT, although a very simply concept, this last insight of mine is a paradigm-shift (a technological tour de force) of epic proportions in terms of impact! The performance of Rust and C++ without the clusterfucked complexity.

Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack of the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counter object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller.

2 Which is especially undesirable not just because of the tsuris but because Rust can’t model all (c.f. also) forms of lifetimes and thus is inflexible and incomplete.

@keean brought to my attention Ada’s new research on static memory management which he claims doesn’t require “annotating all the ‘lifetimes’” as Rust requires.

This new research is layered on top of their SPARK language. SPARK is a strict subset of Ada that was designed to have unambiguous semantics and that aimed at formal verification from the start.

However, consider the limitations of SPARK:

SPARK removes some features of the Ada language such as recursion, dynamic memory allocation, access types9 [i.e. pointers], dynamic dispatching and generics. It also imposes certain restrictions on the constructs of the language (e.g. array dimensions are always defined by previously declared type(s) or subtype(s)10).

On top of the language restrictions, it adds annotations that allow for the specification of data-flow, creation of abstract functions and abstract datatypes and to the use of structured assertions (loop invariants are available but package11 invariants are not). There is also a clear distinction between procedures (which may have side-effects) and functions (which are pure/free of side-effects and also have to return a value).

So SPARK has no generics. Rust’s lifetime parameters would be significantly less often annotated if Rust didn’t allow type parameters.

This new research adds pointers to SPARK by enforcing strict control of aliasing.

But Ada’s access pointers are (as I discussed in the past) already severely restricted such that they’re probably as or more cumbersome than Rust’s lifetimes. The research paper also mentioned this:

As part of its strong type checking, Ada also prevents dangling references to objects on the stack or the heap, by providing automatic compile-time checking of “accessibility” levels, which reflect the lifetimes of stack and heap objects.

Conversions between pointer types are restricted to ensure pointers never outlive the objects they designate. Values of a pointer type are by default initialized to null to prevent use of uninitialized pointers, and run-time checks verify that a null pointer is never dereferenced.

Since Ada doesn’t accurately track lifetimes, it can’t support destructors:

Standard Ada supports safe use of pointers (“access types” in Ada) via strong type checking, but safety is guaranteed only for programs where there is no explicit deallocation of pointed-to objects – explicit deallocation is considered “unchecked” programming in Ada, meaning that the programmer is responsible for ensuring that the deallocation is not performed prematurely. Ada can provide automatic reclamation of the entire memory pool associated with a particular pointer type when the pointer type goes out of scope, but it does not automatically reclaim storage prior to that point. It is possible for a user to implement abstract data types that do some amount of automatic deallocation at the object level, but this requires additional programming, and typically has certain limitations.

So although this Ada proposal reduces some lifetime annotations, it still has the same as Rust inability to allow all safe cases2, and has additional limitations which Rust doesn’t have.

And worse the Ada proposal doesn’t even allow pointers to objects on the stack!

AFAICT, Ada is really an incredibly inflexible “What Color is Your Function” clusterfuck.

I much prefer my idea I cited at the start of this post which attains static memory safety with 100% flexibility, 100% fully feature, with 0% tsuris, and doesn’t even require a formal verification! How about that for a paradigm shift! And if we still need Rust’s lifetime analysis in some highly critical code paths, then let’s consider my idea to make the annotations less noisy. But frankly if we’re going to add optional lifetimes for functions that we want to be extra performant (so we can prove non-aliasing to the compiler), I think we should devise strong proving which can prove cases that are safe which Rust can’t2. However, in that case I really think we should just trust the programmer to write those critical performance paths correctly and label as non-aliased (e.g. the restrict keyword in C) those which the programmer is aren’t aliased (although generally proving non-aliasing at compile-time is a total order although there are scenarios where it can be determined locally, e.g. locally allocated distinct objects), unless we need formal verification for some sort of absolute guarantee of correctness.

EDIT: I’m not claiming the Ada research has no utility. The STARK subset provides for formal verification. So within that STARK language, the additional feature of pointers to objects on a the heap is an improvement. I’m just arguing that is not the ideal paradigm for a general purpose PL wherein productivity and ease-of-programming is a higher priority than formal verification.

shelby3 avatar Aug 03 '18 06:08 shelby3

I wrote:

Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack or the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counter object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller of the function.

The only reasons I can think of for why automatically reference counted (ARC) objects would be employed inside Actor code:

  1. For references that aren’t persisted after the Actor returns. These are objects which are either sent in a message call to another Actor and/or which have destructors (i.e. that execute if the reference count is zero).

  2. For references that persist after the Actor returns. These are either as persistent state owned by the Actor or objects returned to the caller of the Actor.

My idea is to have a type annotation (for reference counted objects) which indicates that a reference to that object persists after the Actor returns. The compiler will enforce the invariant by checking that objects with this annotation can only be assigned to references which are not re-assignable (aka const in JavaScript but not in C++) including those returned from functions that return this annotated type. And the compiler will check that every such annotated reference which was not an input argument must be returned by that function. If the function is at the top-level of the stack then no such checks are required because the reference can’t go out of scope and decrement the reference count until the Actor returns (thus such annotated reference may also be stored in persistent Actor data).

Such annotated types can then be freely assigned to non-ARC references. This mostly ameliorates the bifurcation (analogous to “What Color is Your Function?”) problem from the two kinds of references due to segregated GC schemes for shared and non-shared objects (that has been proposed in this thread).


EDIT: Unlike Pony, our non-ARC objects can’t be shared. They must be copied to an (said annotated) ARC object in order be shared with another Actor (i.e. sent in a message). We could also adapt some of Pony’s principles on sharing objects. In that Pony allows iso and trn objects which can be locally mutated before shared between Actors/threads. ~~That may break the software cache coherency I invented up-thread for massively scaling multicore. So we probably don’t want to copy that flexibility in Pony which actually makes it more abstruse.~~ See the chart and discussion I wrote in other thread.

Data race protection (which should not be conflated with protection against all forms of semantic dead and live lock faults) in both Pony and my idea is due to the fact that the Actor is not reentrant (i.e. runs only single-threaded) and each Actor only runs only one thread at a time and never can more than one Actor can write to same shared object (not just not simultaneously, rather not ever except for the exclusive local writes of iso and trn to non-shared objects before sharing them). Pony also has exception safety but in a non-ideal purist form which Zer0 will improve.

shelby3 avatar Aug 16 '18 10:08 shelby3

My idea is to have a type annotation (for reference counted objects) which indicates that a reference to that object persists after the Actor returns.

Why creating an extra type for it? A type says only something about how its values are stored.

Rather I would use keywords just like const, mutable to restrict access of some pointers.

sighoya avatar Aug 16 '18 11:08 sighoya