axi icon indicating copy to clipboard operation
axi copied to clipboard

question about hit miss unit in last level cache

Open bufans opened this issue 4 years ago • 4 comments

Does this llc_hit_miss unit use one simple state named busy_d to control tag store read? I think the throughput of hit_miss_unit can not be one request/descriptor per clock cycle. Why not use pipelined scheme ?

bufans avatar May 30 '21 09:05 bufans

Hi @bufans, thanks for your questions.

Does this llc_hit_miss unit use one simple state named busy_d to control tag store read?

The busy_q FF defines if the hit-miss unit currently holds a valid descriptor.

I think the throughput of hit_miss_unit can not be one request/descriptor per clock cycle. Why not use pipelined scheme ?

The hit-miss unit accepts descriptors on the desc_i input, which is handshaked with the valid_i and ready_o signals. In both states of busy_q, the hit-miss unit can accept a descriptor (raising ready_o) in the same cycle as a new descriptor is applied (with valid_i). The relevant lines are 265 for busy_q == 1 and 312 for busy_q == 0. Through these handshakes, the hit-miss unit can accept one descriptor per clock cycle.

Does that address your concern, or did you mean something else? If your concern is not addressed, can you please elaborate how you would pipeline the descriptor handling?

@WRoenninger might be able to complement my explanations.

andreaskurth avatar Jun 02 '21 12:06 andreaskurth

@accuminium, thanks for your answer. I got your meaning. But for the SPM request, I did not see ready_o was set? Can the hit-miss unit accept SPM requests per clock cycle?

Another issue, I think the ready_o is only dependent on whether the tag_store can accept one access request? The control logic for tag_store access may be split with the control logic for the tag_store responses stage?

bufans avatar Jun 03 '21 03:06 bufans

Hi @bufans,

It is as @accuminium described. Non SPM-tagged descriptors are potentially stopping in llc_hit_miss for the lookup and can be waiting for the response of the tag_store. Think of the descriptor doing a 'pit-stop', like in racing. hit_miss is able to handle one descriptor per cycle, under the conditions that the descriptor hits, does not need to update the tag inside of the tag_store and does not collide with other descriptors further down the pipeline. hit_miss makes sure that only exactly one descriptor per cacheline and way is active further down the pipeline until it finished the data handling.

The lookup request to the tag_store for cache descriptors only happens when a descriptor enters the unit and consequently ready_o == 1. Updates to tag_store are done, when it leaves the unit. Due to the tag_store then handling the update no descriptor is loaded, ready_o == 0 and busy_q == 0 in the next cycle where it then can load a new one. This is mainly because the tag_store is modeled with a single port RAM and to prevent reordering issues with tag updates. This scheme would also allow for RAMs which have more that one cycle delay. On misses there can also a bit delay with the evict response calculation, see the evict_box. SPM descriptors are not doing requests to the tag_store and are simply fed though desc_q. This is done to reuse resources and has its origin in the idea that the SPM functionality comes basically 'for free' in this architecture. SPM also needs some of the logic in hit_miss to prevent reordering in respect to the AXI protocol (ID ordering), as hit_miss is also responsible for routing a descriptor to the hit-bypass or the evict pipeline. Currently, when an SPM descriptor leaves, it does not load a new descriptor right away and goes to idle.

One thing to consider: The LLC hangs typically behind a higher level cache and was written with the expectation that the usual access pattern is bursted (also expects bursts onto SPM regions). Each descriptor describes the actions to take on a whole cache line. Think of it as a sub-burst of the whole burst. This means that the hit_miss does not need to be able to process one descriptor per cycle, as the limiting factor will be the data access further down the pipeline and on the AXI write and read data channels. If the requests are short, e.g. simultaneous read and write requests with ax.len < 2, I see a potential throughput issue.

I hope this clarifies things. Feel free to ask if you have further questions.

WRoenninger avatar Jun 04 '21 16:06 WRoenninger

@WRoenninger Thank you for such a great reply.

I didn't realize the tag_store was a single-port SRAM. It is all justified now.
If dual-port SRAM is used in tag_store, maybe one forwarding path can solve the read after write problem?

About the throughput requirement for an AXI last level cache, I agree with the point that the data throughput is higher than the request throughput.

bufans avatar Jun 06 '21 03:06 bufans