vunit Regression Bug in AXI-Stream slave when upgrading from v4.0.8

trafficstars

Hi, I triggered a regression bug when upgrading VUnit to v4.2.0 from v4.0.8. With git bisect I found the offinding commit 3e098f567af0a442ae9f3a41b0fcb90f2092daed, merged in #429 .

When using the axi-stream slave, the ready signal starts toggling.

Same issue is in the axi_stream_slave, although the tests are passing, just need approximately twice as long. The offending line is https://github.com/VUnit/vunit/blob/master/vunit/vhdl/verification_components/src/axi_stream_slave.vhd#L95, however I haven't digged deep enough to find out why this change was made.

Oct 23 '19 12:10 FranzForstmayr

FTR, other changes to AXI-Stream a few days before #429, introduced another partial regression as explained in #494. Does GitHub allow to easily compare v4.0.8 and v4.2.0 of subdir https://github.com/VUnit/vunit/tree/master/vunit/vhdl/verification_components/src only?

Oct 23 '19 12:10 umarcor

We've hit something similar some time ago. You probably use the pop function with direct response variables or the check function with blocking=>True. In this case, the slave takes one message, handles it and goes back to idle. Going back to idle causes ready to go low for one cycle. You either have to use pop with references or check with blocking=>False.

That was different, probably before the merge you mentioned. But the process structure in the slave was changed.

Oct 23 '19 12:10 eschmidscs

I created a synthetic example in https://github.com/eine/vunit/tree/axis/synthetic/examples/vhdl/array_axis_vcs/src/test, in an attempt to better understand this issue. According to previous comments here and in #673, the blocking behaviour is produced by pop_axi_stream in https://github.com/eine/vunit/blob/axis/synthetic/examples/vhdl/array_axis_vcs/src/test/tb_axis_loop.vhd#L113 (that is, by the API of the AXI Stream Slave verification component).

In #673, @FranzForstmayr proposed using a queue (https://github.com/VUnit/vunit/pull/673/files#diff-f2d4d6a3240eda80ddc8a14823d070f532d73aa493d7680a449d846445014477R113-R121), which implies using pop_axi_stream with references and an intermediate queue. That was suggested by Emanual above. Hence, Franz's solution does not really fix the issue, but works around it by introducing a "throughput" adapter. I.e., the queue is effectively a FIFO.

On the one hand, I am concerned about the apparent inconsistency between push_axi_stream and pop_axi_stream. The former does not introduce a cycle, while the latter does it. I think it is acceptable for both of them to introduce a cycle (should there be technical constraints that require it), but inconsistency is confusing. @olafvandenberg, since you implemented #429, do you mind providing some insight?

On the other hand, new_axi_stream_slave and new_axi_stream_master accept actor as an optional argument. This can be used for embedding queues into the verification components:

  constant master_axi_stream : axi_stream_master_t := new_axi_stream_master (data_length => data_width, actor => new_actor(inbox_size => 64, outbox_size => 64));
  constant slave_axi_stream  : axi_stream_slave_t  := new_axi_stream_slave  (data_length => data_width, actor => new_actor(inbox_size => 64, outbox_size => 64));

Therefore, using an external queue for using pop_axi_stream by reference feels non idiomatic to me. If necessary, that should be internal to the VC.

/cc @LarsAsplund

Nov 09 '20 18:11 eine

I'm not sure I fully understand the issue. The pop_axi_stream with reference can be fully queued up, without any cycles in between. When using the pop with a direct await reply, there is indeed a low cycle, so back-to-back transfers like this don't work I guess.. Is that the issue?

The changes (if I remember correctly) at that time were made to align to the clock input of the VC. Before the changes it could occur that the VC became asynchronous when using a wait for time message i think. The wait for time I'm using a lot to generate randomized / bursty style of data traffic.

The queue being outside of the vc, was already the case I think in the original version. It follows the same style as the stream slave vc, i guess if the different style (with passing the actor some sizes) is going to be used, it should be adapted in every vc that uses this style.

Nov 10 '20 10:11 olafvandenberg

I'm not sure I fully understand the issue. The pop_axi_stream with reference can be fully queued up, without any cycles in between. When using the pop with a direct await reply, there is indeed a low cycle, so back-to-back transfers like this don't work I guess.. Is that the issue?

The issue is the different behaviour of pop_axi_stream, depending on the overload:

https://github.com/VUnit/vunit/blob/710f68d51f1c2a82db2b6cf589b3e28a1f3194f0/vunit/vhdl/verification_components/src/axi_stream_pkg.vhd#L256-L261 forces a low cycle.
https://github.com/VUnit/vunit/blob/710f68d51f1c2a82db2b6cf589b3e28a1f3194f0/vunit/vhdl/verification_components/src/axi_stream_pkg.vhd#L264-L266 does not force a low cycle.

My understanding is that the two await_pop_axi_stream_reply variants are the ones that should introduce a low cycle:

https://github.com/VUnit/vunit/blob/710f68d51f1c2a82db2b6cf589b3e28a1f3194f0/vunit/vhdl/verification_components/src/axi_stream_pkg.vhd#L269-L279
https://github.com/VUnit/vunit/blob/710f68d51f1c2a82db2b6cf589b3e28a1f3194f0/vunit/vhdl/verification_components/src/axi_stream_pkg.vhd#L281-L286

But the three pop_axi_stream should behave equally, and not introduce a low cycle.

So, the question is whether the different behaviour of pop_axi_stream corresponds to some limitation of how you implemented the stall feature, or was it an unwanted side effect.

Nov 11 '20 00:11 eine

Ok I get it.. It is an unwanted side effect of the change needed for the clock re-alignment. I'll have a look today to see if I can think of a proper fix for it.

Nov 11 '20 08:11 olafvandenberg

So I had a closer look, and I can't seem to reproduce this behavior. There is a small glitch on ready if you do back-to-back transfers while using the blocking version of the pop_axi_stream procedure. This we can fix easily by a small change, if necessary. The low cycle I do see when there is no new message in the queue before the reply is send, so not truly back-to-back in that case. For example:

for i in 3 to 14 loop
    if i = 14 then
          last := '1';
    end if;
    pop_axi_stream(  net, slave_axi_stream,
                     tdata => data,
                     tlast => last,
                     tkeep => keep,
                     tstrb => strb,
                     tid   => id,
                     tdest => dest,
                     tuser => user
                  );
end loop;

The code above will not give low cycles, only a small glitch on ready which can be removed by lowering ready after sending the reply and only if the queue is empty.

In the picture above, the first set of transfers is using the check_axi_stream, which is non-blocking. The second set of transfers uses the blocking pop_axi_stream.

The code below does give a low cycle, but as stated is not truly back-to-back:

for i in 3 to 14 loop
        if i = 14 then
          last := '1';
        end if;
        pop_axi_stream(  net, slave_axi_stream,
                         tdata => data,
                         tlast => last,
                         tkeep => keep,
                         tstrb => strb,
                         tid   => id,
                         tdest => dest,
                         tuser => user
                      );

        wait for 1 ps;
      end loop;

Removing the glitches by moving the tready <= '0' in the axistreaming slave:

Nov 11 '20 09:11 olafvandenberg

Sorry for mixing in again here, but this looks really interesting. I am pretty sure that we did not have this behaviour here. What simulator are you using? And which testbench? Can I try to run that one here? What happens if you replace the "wait for 1 ps" in the 50%-duty-cycle example with "wait for 0ps" (i.e. communication with vunit)?

Nov 11 '20 13:11 eschmidscs

I'm using Modelsim PE 10.7b. For the test I did a small modification in the tb_axi_stream.vhd. the test I modified is "test back-to-back passing check", I've simply copied the test in total, and replaced the check_axistream with a pop_axi_stream:

    elsif run("test back-to-back passing check") then
      wait until rising_edge(aclk);
      timestamp := now;

      last := '0';
      for i in 3 to 14 loop
        if i = 14 then
          last := '1';
        end if;
        push_axi_stream(net, master_axi_stream,
                        tdata => std_logic_vector(to_unsigned(i, 8)),
                        tlast => last,
                        tkeep => "1",
                        tstrb => "1",
                        tid => std_logic_vector(to_unsigned(42, id'length)),
                        tdest => std_logic_vector(to_unsigned(i+1, dest'length)),
                        tuser => std_logic_vector(to_unsigned(i*2, user'length)));
      end loop;

      last := '0';
      for i in 3 to 14 loop
        if i = 14 then
          last := '1';
        end if;
        check_axi_stream(net, slave_axi_stream,
                         expected => std_logic_vector(to_unsigned(i, 8)),
                         tlast => last,
                         tkeep => "1",
                         tstrb => "1",
                         tid => std_logic_vector(to_unsigned(42, id'length)),
                         tdest => std_logic_vector(to_unsigned(i+1, dest'length)),
                         tuser => std_logic_vector(to_unsigned(i*2, user'length)),
                         msg  => "check blocking",
                         blocking  => false);
      end loop;

      check_equal(now, timestamp, result(" setting up transaction stalled"));

      wait_until_idle(net, as_sync(slave_axi_stream));
      check_equal(now, timestamp + (12+1)*10 ns, " transaction time incorrect");

      -- Repeat while using blocking pop_axi_stream to check issue #573
      wait until rising_edge(aclk);
      timestamp := now;

      last := '0';
      for i in 3 to 14 loop
        if i = 14 then
          last := '1';
        end if;
        push_axi_stream(net, master_axi_stream,
                        tdata => std_logic_vector(to_unsigned(i, 8)),
                        tlast => last,
                        tkeep => "1",
                        tstrb => "1",
                        tid => std_logic_vector(to_unsigned(42, id'length)),
                        tdest => std_logic_vector(to_unsigned(i+1, dest'length)),
                        tuser => std_logic_vector(to_unsigned(i*2, user'length)));
      end loop;

      last := '0';
      for i in 3 to 14 loop
        if i = 14 then
          last := '1';
        end if;
        pop_axi_stream(  net, slave_axi_stream,
                         tdata => data,
                         tlast => last,
                         tkeep => keep,
                         tstrb => strb,
                         tid   => id,
                         tdest => dest,
                         tuser => user
                      );
      end loop;

      check_equal(now, timestamp, result(" setting up transaction stalled"));

      wait_until_idle(net, as_sync(slave_axi_stream));
      check_equal(now, timestamp + (12+1)*10 ns, " transaction time incorrect");

note that the check on time will fail, since the blocking pop will consume time..

Nov 11 '20 15:11 olafvandenberg

So I had a closer look, and I can't seem to reproduce this behavior.

Can you please try the array_axis_vcs example? The following waveform was generated with GHDL (python3 run.py -g) on Windows (MSYS2/MINGW64).

Note: that waveform also shows 5% stalls for both the master and the slave.

The modification in https://github.com/VUnit/vunit/pull/673/files#diff-f2d4d6a3240eda80ddc8a14823d070f532d73aa493d7680a449d846445014477 shows that the UUT can handle back-to-back transfers. However, the current example using a single for loop does introduce dead cycles.

The code above will not give low cycles, only a small glitch on ready which can be removed by lowering ready after sending the reply and only if the queue is empty.

Maybe this is a different behaviour between ModelSim and GHDL? So, where ModelSim is showing a glitch, GHDL shows a cycle?

Or is it because of https://github.com/VUnit/vunit/blob/master/examples/vhdl/array_axis_vcs/src/test/tb_axis_loop.vhd#L122 ?

Nov 12 '20 03:11 eine

I tried to run the same test as @olafvandenberg in questasim 2019.2_2. The result is obviously 50%-DC:

This starts to be scary...

Nov 12 '20 07:11 eschmidscs

... on the other hand: If the result is 100% or 50% DC depends on the loop exit in axi_stream_slave. It exits if the queue is empty - and then it waits for the next clock edge, causing 50% DC. The question if the queue is empty depends on the message sequence. And this sequence depends in the end on delta-cycle order, which is AFAIK not specified by IEEE. So you end up with simulator dependent behaviour.

Nov 12 '20 07:11 eschmidscs

I pulled the branch axis/synthetic from eine's fork, and ran the array_vcs example tb_axis_loop. I had to comment out some psl stuff in the fifo that breaks compilation on my side. The result is shown below: As you can see there are the glitches, but no low cycles.

I wonder if it might have something to do with delta cycles.

update: @eschmidscs I didn't see your latest reply, but indeed I think you are correct.

Nov 12 '20 08:11 olafvandenberg

The delta cycle is proven in my opinion. I've added one by inserting a wait for 0 ps; right before the tready <= '0' on line 106 in axi_stream_slave.vhd, and I get 50% dc:

Expanding to deltas for the 50% dc: delta_50%dc

The same for the glitch: delta_glitch

Nov 12 '20 08:11 olafvandenberg

FTR, as commented in ghdl/ghdl#752, it seems that this clock alignment issue might have further effects than the glitch or additional cycle. The workaround in https://github.com/VUnit/vunit/issues/494#issuecomment-494167501 should provide control over the sizes of the buffers, but it seems not to work. Unfortunately, I did not get to understand what's going on...

Nov 17 '20 17:11 umarcor

I have also run into this issue. I am seeing different behavior even between simulators from the same vendor (Mentor).

Using several different versions of Modelsim (e.g. Modelsim AE 2020.1), I see the 100% duty cycle behavior (ignoring a single idle cycle near the start, which is not a problem in my case):

Using Questa FE 2021.2, I see the 50% duty cycle behavior (which is a problem in my case):

This is causing quite a headache. Has anyone found a workaround? I tried inserting large numbers of delta cycles here and there, but I wasn't able to change the behavior.

Jan 05 '22 17:01 harry-commin-enclustra

The workaround: In a first loop, you pop as many references as you expect transactions with the corresponding pop_axi_stream procedure and store them to a queue. This does not require simulation time. Then you add a second loop that iterates over the same number, each time taking a reference from the given queue and calling await_axi_stream_reply with it. This requires a bit more memory but decouples the AXIS behaviour from message passing.

Jan 06 '22 07:01 eschmidscs

@eschmidscs that approach will solve the 50% duty cycle behavior by providing 100% instead. But I am more concerned with the fact that different simulators have different behavior (modelsim vs ghdl/questa).

If we clearly state that for 100% throttle, one should use the unblocking pop_axi_stream and the blocking pop_axi_stream cannot provide 100% throttle, perhaps we should actually force TREADY low for one clock cycle and not rely on simulator behaviors during delta cycles.

This way the behavior of the blocking pop_axi_stream will be the same across different simulators.

Jan 06 '22 11:01 tasgomes

@tasgomes There seem to be two different questions. I tried to answer the question about the workaround @hcommin was asking for. The question you are raising is about the consistent behaviour, what would not be a workaround anymore, but a real fix. There are some interesting thoughts in this.

If you look at #477, the blocking nature is assumed to be the standard case. From that point of view, I agree with your statement that TREADY should always go low after one transaction.

Personally, I would prefer to have a pop function that does not always cause a stall. But this does not work with the current VC architecture.

Jan 06 '22 12:01 eschmidscs

@eschmidscs Thanks for the reply. I think we are in sync. Ideally the pop function should allow 100% throttle, but if that is not possible, I would then force TREADY low for a complete clock cycle in the axi stream component just to make it consistent across simulators.

Jan 06 '22 12:01 tasgomes

I would like to find a proper fix for this issue. In my opinion the difficulty is that in the current setup you can't predict whether there will or will not be another transaction "immediately" after sending the reply. So there is no other option than send the reply, check if there is something on the queue and if there isn't, block until a new event occurs. Would it be an option to extend the interface with something that indicates that another transaction will follow, and therefore the bus_process shouldn't start blocking on the Clk event just yet?

edit: something like an extra boolean flag in the pop_axi_stream calls with a default value that will give either a forced 50% throttle behavior (to remove the simulator dependent behavior) or will provide 100% throttle but then is expected to get more pops immediately after the reply while maintaining correct clock alignment. Could also be a seperate call perhaps, like 'open session' -> pop, pop, pop -> 'close session'.

would like to hear your view(s) on this..

Jan 11 '22 15:01 olafvandenberg

@olafvandenberg without doing any proper code study: what happens if you add a "wait for 0ps;" before pulling TREADY low, but then check again for pop transactions before stalling? That would allow the message passing to run before the stall, so depending on other waits the second check would actually see the new pop request. But there might be so many side effects...

Jan 11 '22 16:01 eschmidscs

@eschmidscs I think that would give consistent 50% behavior, so always >= 1 low cycle. (as mentioned in one of my earlier posts) The checking for pop transactions is actually done in this case by the queue empty check if I'm correct. perhaps a forced delta cycle after the reply might do something. I will try to create a test setup to examine the behavior..

Jan 11 '22 16:01 olafvandenberg

@olafvandenberg That is what I intended: A forced delta cycle by the wait, but then another check of the queue. Did your test above with the wait already check the queue after the test?

Jan 11 '22 16:01 eschmidscs

@eschmidscs I believe so, as the forced delta would be inside the while loop on 'not is_empty(message_queue)'. I think the problem is that the blocking pop is made of a non-blocking pop followed by a wait for reply which blocks. The pop message is send to the vc actor, which arrives in the main process, which forwards it to the bus process by means of the queue. This queue is checked for emptiness, but I assume it will be very unlikely that a next pop message can be present on this queue, which is (in user code) send after receiving the reply, which in turn is send by the bus process.

If there would be prior knowledge that a new pop message will come, we can wait for that instead of waiting on a clk'event and a non-empty queue. (Using a notify signal similar to the signal back to the main process)

Jan 11 '22 16:01 olafvandenberg

The problem with prior knowledge is always that it is not readily available. The workaround of poping references and then waiting for them is kind of generating exactly this prior knowledge: You tell the vc how many transactions to expect. I fear that another notify signal will just result in some similar kind of uncertainty.

Jan 11 '22 16:01 eschmidscs

@olafvandenberg Just for the record: My idea with "wait for 0ps" cannot work because the same statement is probably used in the message passing (didn't check). So it won't help to push out the decision until "all" message passing has been done...

Jan 11 '22 17:01 eschmidscs

@eschmidscs yeah, I checked and I'm able to fix the issue by adding two consecutive 'wait for 0 ps;' statements right after the reply() call. I believe a separate notify will give a more reliable solution. I will continue on Thursday with this..

Jan 11 '22 17:01 olafvandenberg

I pushed a possible fix to a branch on my fork. It forces a suspend (in delta cycles) until the reply message has been handled (because of the signal assignment) in my opinion (can't say I'm very experienced in regards to delta cycles though, so please review) @hcommin could you give this possible fix a try?

Jan 13 '22 14:01 olafvandenberg

Interesting thought... But is that different from "wait for 0 ps"? I think both will just add some delta cycles - and if the message passing requires more, it will be later and the stall will happen. Anyone writing simulators listening? ;)

Jan 14 '22 08:01 eschmidscs

vunit vunit copied to clipboard

Regression Bug in AXI-Stream slave when upgrading from v4.0.8

vunit
vunit copied to clipboard