Larger simulations are not deterministic
To Reproduce MGPUSim version of commit ID: 40c4cd4
Command that recreates the problem
./fir -length=65536 -timing
Current behavior The estimated execution time differs from execution to execution
Expected behavior The estimated execution time should be the same.
I want to know if this bug has been fixed. I found that the results of parallel engine are also different from those of serial engine.
The problem is still there.
If parallel simulation is used, the simulation will be underministic for sure. The scope of making the simulation deterministic only applies to single-kernel serial simulation.
Also, for parallel simulations, how different are they from serial simulations?
In the fir with 4096 * 32 samples to filter, the parallel simulations may slower 3% than serial simulation. In addition, I print all the events and their scheduled time. In parallel simulation, the first event of mmu is scheduled at 0.0000000120, but in serial simulation, it is scheduled at 0.0000000350.
The problem is still there.
If parallel simulation is used, the simulation will be underministic for sure. The scope of making the simulation deterministic only applies to single-kernel serial simulation.
Also, for parallel simulations, how different are they from serial simulations?
I wonder what the possible cause of this problem is, Golang or mgpusim itself. If I know the possible reason, I may be able to try fixing this bug myself.
Well, we cannot blame Go for this. There are definitely some features will cause non-deterministic execution, we should avoid those.
Here are some good discussion on how to avoid non-deterministic behavior in Go https://github.com/golang/go/issues/33702. They also point to the potential source of non-deterministic behavior.
One thin I am thinking about is to try to create super simple simulations. The root of the problem may be on the Akita side.
The difference in parallel and serial simulations is another problem. I have created #45 for this problem. For now, can you mainly use the serial simulation?
I can use the serial simulation currently. Thanks for your reply.
Prof.Sun, I think I may fix the bug about larger simulations are not deterministic.
Firstly, I record the scheduling and handling order of events and find that the access order of the device port of an endpoint is not deterministic (akita/noc/networking/switching/endpoint.go:sendFlitOut(now)), which causes that the Tick() function of different components connected to the endpoint may be executed randomly.
As the figures show, the left figure executes the RDMA firstly but the right figure executes the commandprocessor firstly. This is because when the timing platform plugin the device onto the endpoint (PlugInDevice(pcieSwitchID, gpu.Domain.Ports())), the result of gpu.Domain.Ports() is not deterministic.
So, I modify the codes of Port() and the bug looks like be fixed.
@MaxKev1n Looks great! Can you start a pull request, and I can look deeper into it?
BTW, there is a deterministic test script under test/deterministic. In the Python file, you can see that there is a line being commented out. You can reinclude that line and see if the problem is solved. If not solved, at least determinicity can last until what problem size.
In your deterministic test script, I find that running fir with single GPU is not able to reproduce the problem, so I run fir with 4 GPUs and reproduce the problem successfully. My code can eliminate the majority of deterministic problem except a set of metrics called CPIStack, I think CPIStack may have other problem. And I found that there may be a small difference in the total time of fir, but I think this is acceptable.
@MaxKev1n Thanks for the PR. I am merging it.
However, I do not think this problem is fully resolved given the small difference. Being fully deterministic is more about debugging. When we find a bug, we want to rerun the program and the bug takes place at the exact same location. We will keep looking into the problem. I think we are close.