thorin2 icon indicating copy to clipboard operation
thorin2 copied to clipboard

optimization exhibits non-deterministic behavior

Open NeuralCoder3 opened this issue 2 years ago • 4 comments

Sometimes, the behavior of the optimization pipeline seems to be non-deterministic.

Example: ./build/bin/thorin -d mem -o - lit/mem/no_mem.thorin -VVVV in https://github.com/NeuralCoder3/thorin2/tree/ad_ptr_merge 702d848

The issue might be due to the add_mem optimization, the pipeline builder, or an underlying bug in thorin.

This behavior might also be a side effect of the previous (not merged yet) changes to mem and clos conv with long-reaching impact that did not manifest up to now.

NeuralCoder3 avatar Nov 11 '22 13:11 NeuralCoder3

Yes. this is super annoying. Another source is this:

world.app(emit1(), emit2());

It's implementation defined whether emit1() is happened first or second. This code has different behavior on different compilers/OS's.

I have implemented the --trace-gids switch that we could somehow use to test for this in our CI.

leissa avatar Nov 14 '22 13:11 leissa

The issue happens only sometimes on with the same executable on the same computer in the same cirumstances. Therefore, timing issues or randomness might be the cause.

Probably related issue: ./build/bin/thorin -d matrix -d affine lit/matrix/mapReduce_mult.thorin -o - -VVVV in matrix_dialect f3a3def sometimes generates thorin code and sometimes prints the following error:

:4294967295: error: cannot pass argument 
  '(__806508#2:(.Idx 3), ‹__806508#2:(.Idx 3); .Idx 4294967296›, 0)' of type 
  '[.Nat, «__806508#2:(.Idx 3); ★», .Nat]' to 
  '%mem.lea' of domain 
  '[n_834521: .Nat, _834535: «n_836768; ★», _834540: .Nat]'

which seems odd to me as the arguments are of the style

(n, <n; T>; 0)

which should be the type

[n:.Nat, <<n; *>>; .Nat]

which should agree with lea.

NeuralCoder3 avatar Nov 17 '22 13:11 NeuralCoder3

Was fighting this issue in #184 as a Debug build produced different outputs as the Release one

  • 05e833b23e3318b1441f3548cbf8d636b6f0502b A few asserts created new Defs resulting in slightly different behavior between Debug and Release builds. This commit fixes the issue.
  • 2997a1d9f7a611b5aa48cea33a7b008ded00e99d This one fixes a subtle problem when a Def has coincidentally the same name as an external Def.

As mentioned above --trace-gids and --reeval-breakpoints helped me tracking down the problem. We could probably write a test case with some non-trivial code, run it with --trace-gids and double-check in our CI that all builds produce the same output.

leissa avatar Mar 08 '23 00:03 leissa

While #185 fixes part of this problem, there are still some odd things happening and we need a test case to test for this.

leissa avatar Mar 27 '23 14:03 leissa