mgpusim Larger simulations are not deterministic

To Reproduce MGPUSim version of commit ID: 40c4cd4

Command that recreates the problem

./fir -length=65536 -timing

Current behavior The estimated execution time differs from execution to execution

Expected behavior The estimated execution time should be the same.

Dec 18 '23 17:12 syifan

I want to know if this bug has been fixed. I found that the results of parallel engine are also different from those of serial engine.

Mar 06 '24 12:03 MaxKev1n

The problem is still there.

If parallel simulation is used, the simulation will be underministic for sure. The scope of making the simulation deterministic only applies to single-kernel serial simulation.

Also, for parallel simulations, how different are they from serial simulations?

Mar 06 '24 12:03 syifan

In the fir with 4096 * 32 samples to filter, the parallel simulations may slower 3% than serial simulation. In addition, I print all the events and their scheduled time. In parallel simulation, the first event of mmu is scheduled at 0.0000000120, but in serial simulation, it is scheduled at 0.0000000350.

Mar 06 '24 13:03 MaxKev1n

The problem is still there.

If parallel simulation is used, the simulation will be underministic for sure. The scope of making the simulation deterministic only applies to single-kernel serial simulation.

Also, for parallel simulations, how different are they from serial simulations?

I wonder what the possible cause of this problem is, Golang or mgpusim itself. If I know the possible reason, I may be able to try fixing this bug myself.

Mar 06 '24 13:03 MaxKev1n

Well, we cannot blame Go for this. There are definitely some features will cause non-deterministic execution, we should avoid those.

Here are some good discussion on how to avoid non-deterministic behavior in Go https://github.com/golang/go/issues/33702. They also point to the potential source of non-deterministic behavior.

One thin I am thinking about is to try to create super simple simulations. The root of the problem may be on the Akita side.

The difference in parallel and serial simulations is another problem. I have created #45 for this problem. For now, can you mainly use the serial simulation?

Mar 06 '24 14:03 syifan

I can use the serial simulation currently. Thanks for your reply.

Mar 06 '24 14:03 MaxKev1n

Prof.Sun, I think I may fix the bug about larger simulations are not deterministic.

Firstly, I record the scheduling and handling order of events and find that the access order of the device port of an endpoint is not deterministic (akita/noc/networking/switching/endpoint.go:sendFlitOut(now)), which causes that the Tick() function of different components connected to the endpoint may be executed randomly.

As the figures show, the left figure executes the RDMA firstly but the right figure executes the commandprocessor firstly. This is because when the timing platform plugin the device onto the endpoint (PlugInDevice(pcieSwitchID, gpu.Domain.Ports())), the result of gpu.Domain.Ports() is not deterministic.

So, I modify the codes of Port() and the bug looks like be fixed.

Mar 12 '24 04:03 MaxKev1n

@MaxKev1n Looks great! Can you start a pull request, and I can look deeper into it?

Mar 12 '24 12:03 syifan

BTW, there is a deterministic test script under test/deterministic. In the Python file, you can see that there is a line being commented out. You can reinclude that line and see if the problem is solved. If not solved, at least determinicity can last until what problem size.

Mar 12 '24 13:03 syifan

In your deterministic test script, I find that running fir with single GPU is not able to reproduce the problem, so I run fir with 4 GPUs and reproduce the problem successfully. My code can eliminate the majority of deterministic problem except a set of metrics called CPIStack, I think CPIStack may have other problem. And I found that there may be a small difference in the total time of fir, but I think this is acceptable.

Mar 12 '24 15:03 MaxKev1n

@MaxKev1n Thanks for the PR. I am merging it.

However, I do not think this problem is fully resolved given the small difference. Being fully deterministic is more about debugging. When we find a bug, we want to rerun the program and the bug takes place at the exact same location. We will keep looking into the problem. I think we are close.

Mar 13 '24 12:03 syifan