AMDMIGraphX icon indicating copy to clipboard operation
AMDMIGraphX copied to clipboard

Random number range expansion for Perf tests

Open lakhinderwalia opened this issue 10 months ago • 25 comments

lakhinderwalia avatar Apr 04 '24 20:04 lakhinderwalia

We probably want an enum to select different generators because I dont think these type of generators are good for verification.

pfultz2 avatar Apr 04 '24 21:04 pfultz2

Codecov Report

Attention: Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.

Project coverage is 92.02%. Comparing base (9f5af93) to head (97b181b). Report is 154 commits behind head on develop.

Files with missing lines Patch % Lines
src/include/migraphx/generate.hpp 91.66% 1 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2953      +/-   ##
===========================================
+ Coverage    91.99%   92.02%   +0.02%     
===========================================
  Files          489      490       +1     
  Lines        19423    19437      +14     
===========================================
+ Hits         17869    17886      +17     
+ Misses        1554     1551       -3     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 04 '24 21:04 codecov[bot]

Test Batch Rate new
97b181
Rate old
2c4dd4
Diff Compare
torchvision-resnet50 64 1,743.59 1,744.86 -0.07% :white_check_mark:
torchvision-resnet50_fp16 64 4,045.74 4,050.81 -0.13% :white_check_mark:
torchvision-densenet121 32 1,462.80 1,462.21 0.04% :white_check_mark:
torchvision-densenet121_fp16 32 2,522.86 2,527.03 -0.16% :white_check_mark:
torchvision-inceptionv3 32 877.16 877.83 -0.08% :white_check_mark:
torchvision-inceptionv3_fp16 32 1,485.59 1,485.81 -0.01% :white_check_mark:
cadene-inceptionv4 16 407.14 407.35 -0.05% :white_check_mark:
cadene-resnext64x4 16 419.29 419.43 -0.03% :white_check_mark:
slim-mobilenet 64 4,093.17 4,093.22 -0.00% :white_check_mark:
slim-nasnetalarge 64 101.13 101.12 0.01% :white_check_mark:
slim-resnet50v2 64 1,685.24 1,686.03 -0.05% :white_check_mark:
bert-mrpc-onnx 8 616.67 615.30 0.22% :white_check_mark:
bert-mrpc-tf 1 279.56 276.34 1.17% :white_check_mark:
pytorch-examples-wlang-gru 1 319.03 322.21 -0.99% :white_check_mark:
pytorch-examples-wlang-lstm 1 290.98 294.16 -1.08% :white_check_mark:
torchvision-resnet50_1 1 474.37 476.51 -0.45% :white_check_mark:
cadene-dpn92_1 1 249.78 248.85 0.37% :white_check_mark:
cadene-resnext101_1 1 200.63 205.81 -2.52% :white_check_mark:
onnx-taau-downsample 1 204.66 204.68 -0.01% :white_check_mark:
dlrm-criteoterabyte 1 22.90 22.89 0.05% :white_check_mark:
dlrm-criteoterabyte_fp16 1 42.72 42.67 0.10% :white_check_mark:
agentmodel 1 6,282.28 6,007.95 4.57% :high_brightness:
unet_fp16 2 34.21 34.25 -0.10% :white_check_mark:
resnet50v1_fp16 1 581.93 599.72 -2.97% :white_check_mark:
resnet50v1_int8 1 585.18 586.21 -0.18% :white_check_mark:
bert_base_cased_fp16 64 646.22 646.09 0.02% :white_check_mark:
bert_large_uncased_fp16 32 199.02 198.99 0.02% :white_check_mark:
bert_large_fp16 1 117.46 117.68 -0.19% :white_check_mark:
distilgpt2_fp16 16 1,212.55 1,210.37 0.18% :white_check_mark:
yolov5s 1 301.49 301.54 -0.02% :white_check_mark:
tinyllama 1 23.33 23.33 0.02% :white_check_mark:
vicuna-fastchat 1 131.24 133.06 -1.37% :white_check_mark:
whisper-tiny-encoder 1 244.30 244.46 -0.06% :white_check_mark:
whisper-tiny-decoder 1 256.25 256.52 -0.11% :white_check_mark:

Check results before merge :high_brightness:

migraphx-bot avatar Apr 10 '24 15:04 migraphx-bot


     :white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance
     :white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance
     :white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance
     :white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance
     :white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance
     :white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance
     :white_check_mark: unet: PASSED: MIGraphX meets tolerance
     :white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance
     :white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance
:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

     :white_check_mark: bert_large: PASSED: MIGraphX meets tolerance
     :white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance
     :white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance
     :white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance
     :white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance
     :white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance
     :white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

migraphx-bot avatar Apr 10 '24 15:04 migraphx-bot

Are there tests for these methods? It seems odd that you can change the range limits and not need to change some test results.

We stay in the same nominal range as before. Floats are between -1.0 to 1.0. Integers in their respective limits: int8 will stay within an 8-bit representation, and so on..

Earlier, the (nominal) range was not sufficiently filled/spread-out. There were only 32 unique floating point values produced off a very large set of possibilities. And that made some of tests extremely limiting in their scope and value.

Thanks.

lakhinderwalia avatar Apr 11 '24 00:04 lakhinderwalia

I'm OK with either 1 or 2. Option 3 would take refactoring a lot of tests so unlikely we'll be doing that.

I think Option 2 is good. Thanks.

lakhinderwalia avatar Apr 11 '24 17:04 lakhinderwalia

Keep in mind that these are the verify unit tests, not the driver verify

The verify unit tests and the driver verify work in a similar manner. The verify unit tests are just smaller test cases instead of end-to-end models.

or accuracy checker verify that uses ORT.

The accuracy checker uses numpy.random.rand to generate the data, which uses values between 0 and 1(so no negative values). It also may not stress the hardware as much.

Have different modes for the random number generation (RNG) as mentioned by Paul with the enums. Have the verify tests continue to use the old RNG and use the new RNG by default for performance testing.

We can just add a flag to the driver command to choose which one we want to use. We can still default to the former(this usually is not a problem with measuring perf difference). And we can have a another flag to stress the hardware.

more work, slightly less clumsy Use the new RNG for everything and update the tolerances for the verify tests that need it.

We shouldn't increase the tolerance as we can miss actual errors as well.

The problem with generating numbers that use all the bits of the mantissa is that when we do any simple operation(such x + y or x * y) there will be significant accuracy loss, whereas as before there would be very little or even no accuracy loss since there are more bits in the mantissa to compute an accurate result. There can still be additional accuracy loss from doing the calculations in a different order(usually due to using a parallel algorithm on the GPU) because float operators are not actually associative.

So we need room for some error from operators done in different order but we dont want to add more error to our calculation which would require us to increase tolerance as that could prevent us from finding bugs.

lot more work, more effective for what we want from the verify tests

This is not only more work to implement but it also makes it more cumbersome to add tests. So this is not a good option.

pfultz2 avatar Apr 11 '24 21:04 pfultz2

Use the new RNG for everything and update the tolerances for the verify tests that need it.

We shouldn't increase the tolerance as we can miss actual errors as well.

The problem with generating numbers that use all the bits of the mantissa is that when we do any simple operation(such x + y or x * y) there will be significant accuracy loss, whereas as before there would be very little or even no accuracy loss since there are more bits in the mantissa to compute an accurate result. There can still be additional accuracy loss from doing the calculations in a different order(usually due to using a parallel algorithm on the GPU) because float operators are not actually associative.

A simple binary FP operation like a * or a + has no order issues from the accuracy perspective. Since we create FP numbers within the range -1.0 to 1.0, there is no case of overflow involved even if multiple adds/multiply/subtract operations were involved.

So we need room for some error from operators done in different order but we dont want to add more error to our calculation which would require us to increase tolerance as that could prevent us from finding bugs.

lot more work, more effective for what we want from the verify tests

This is not only more work to implement but it also makes it more cumbersome to add tests. So this is not a good option.

With just a very few changes in test/verify subdir, this is less work to implement, and almost done.

And it is important to understand here why certain tests are failing:

  1. 4 failing tests are of mixed precision, and the tolerance is deduced to be of the higher-precision value. This mixed test should obviously take a lower tolerance. The idea that a mixed precision case is operating at a tighter tolerance is a false impression -- it is just not been exercised enough!

  2. One or two tests are created for Double Precision , whereas the implementation is in FP32. The tolerance should be lowered to FP32 in this case.

  3. A logically incorrect test is quantize_linear_convert, where the convert stage is actually optimized out for the GPU, and that leads to an apples to oranges comparison. The assumption that this test is currently passing is ignoring the reality.

So far no verify test is failing due to an increased integer range. All the failing tests are FP related. And we can reduce the integer range if really required.

The idea that all the FP verify tests are to be done with just 32 possible values, is very unsound, technically. We should be happy to adjust the right tolerances of these very few cases, without worrying about some future testcases. :-)

lakhinderwalia avatar Apr 15 '24 22:04 lakhinderwalia

The ranges of random number generation have been suitably expanded to provide maximum coverage. All the unit tests for Debug & Non-debug modes pass. A few tolerances had to be adjusted, logically so as well, because of mixed precision calculations -- they can't be expected to match the tighter tolerance of the higher precision component. In fact, the failure of the test expecting the tighter tolerance is a good indicator that the test instrumentation is working as expected. In future, for Debug mode, CPU to Ref comparison, if the Int32 range were not represented by FP32 (also a DNN suite limitation, for the current implementation.) the range of int32 numbers generated can be further expanded.

lakhinderwalia avatar Apr 22 '24 18:04 lakhinderwalia

The background I overheard for this PR is that the original random number generation did not adequately stress the kernels since it has a smaller subset of possible values. With the changes in this PR we're hitting another problem of several verify tests failing because of the changed number generation. Keep in mind that these are the verify unit tests, not the driver verify or accuracy checker verify that uses ORT. We should probably rename these to make the distinction more clear. Here are my ideas to resolve both issues:

  1. less work, more clumsy: Have different modes for the random number generation (RNG) as mentioned by Paul with the enums. Have the verify tests continue to use the old RNG and use the new RNG by default for performance testing.
  2. more work, slightly less clumsy Use the new RNG for everything and update the tolerances for the verify tests that need it.
  3. a lot more work, more effective for what we want from the verify tests Change the verify tests to use golden data such that we're testing both the ref and gpu targets against golden data. We currently test ref against golden data in test_ref and then test other targets against ref. It would make more sense to test directly against golden data. Since the current way these verify unit tests work all we screen for is if the other target results are different from ref. The logical next step for debugging is to figure out if the source of the problem is from ref or the target which means testing both against golden data.

I'm OK with either 1 or 2. Option 3 would take refactoring a lot of tests so unlikely we'll be doing that.

@CharlieL7, the enum option 1 wasn't solving the core issues: namely the limited coverage that such numbers were generating. That would not have caught the issues that ere lurking, and that got exposed!.

It made full sense to follow option 2. All the tests also pass now, a few had to be fixed. And we can feel good about having a much expanded test vector space, something option 1 wasn't. Thanks.

lakhinderwalia avatar Apr 22 '24 18:04 lakhinderwalia

The ranges of random number generation have been suitably expanded to provide maximum coverage. All the unit tests for Debug & Non-debug modes pass. A few tolerances had to be adjusted, logically so as well, because of mixed precision calculations -- they can't be expected to match the tighter tolerance of the higher precision component.

Are we introducing a clear rule that for tests involving mixed precision types, the correct tolerance is the tolerance for whichever is the lower-precision input? That seems logical at least, but do we have it in writing?

bpickrel avatar Apr 22 '24 18:04 bpickrel

The ranges of random number generation have been suitably expanded to provide maximum coverage. All the unit tests for Debug & Non-debug modes pass. A few tolerances had to be adjusted, logically so as well, because of mixed precision calculations -- they can't be expected to match the tighter tolerance of the higher precision component.

Are we introducing a clear rule that for tests involving mixed precision types, the correct tolerance is the tolerance for whichever is the lower-precision input? That seems logical at least, but do we have it in writing?

The original golden number (= 80) for tolerance was chosen in some heuristic way.. so there never was a clear rule/formula to start with. The numbers 8000 etc. are used to lower that tolerance expectation by a certain order (1 or 2 or..) and thus are just a multiplier of the original number. Thanks.

lakhinderwalia avatar Apr 22 '24 20:04 lakhinderwalia

A simple binary FP operation like a * or a + has no order issues from the accuracy perspective.

It does because we can reorder these operators to do things more efficiently.

Since we create FP numbers within the range -1.0 to 1.0, there is no case of overflow involved even if multiple adds/multiply/subtract operations were involved.

I dont mean overflow in that sense, but since we are using all 24-bits of the mantissa so then when we do addition or multiplication we might need a much larger mantissa to represent the value correctly. We can still have a value but there are many bits that will get chopped off.

This causes extra accuracy loss beyond just the loss from reordering so then we need to increase tolerance(which might not catch issues). I see you are increasing the

With just a very few changes in test/verify subdir, this is less work to implement, and almost done.

This tests are being solved by just increasing the tolerance, which we dont want to do.

This didnt find bugs in those cases. The only bug that was found was in cpu/layernorm.cpp, and you didnt fix the bug, you just change the test case so ti would pass.

There are ways we can improve test_verify to find these bugs without needing to change RNG and increasing tolerances. Instead, we can:

  • Use highest precision for ref version
  • Use allclose to compare per element
  • Better tolerance selection based on what is done in the graph

All these can achieve better verification without needing to increase the tolerance.

So far no verify test is failing due to an increased integer range.

We dont have a lot of integer tests, but this could fail more when switching ref to use higher precision. I think we should remove those changes(or have enum flag to switch it).

The idea that all the FP verify tests are to be done with just 32 possible values, is very unsound, technically.

A hardcoded 32 for all FPs is probably not the best choice. I would think something depending on the data type like 1ULL << sizeof(T) * 2 would be better.

Either way, we dont want to use all values of the mantissa for verification, we only want to use it for perf runs, which is why we want to use an enum. None of the other computation libraries(such as rocblas or ck) use the same RNG for verification and perf either.

pfultz2 avatar Apr 25 '24 16:04 pfultz2

The original golden number (= 80) for tolerance was chosen in some heuristic way.

The 80 was chosen because 80 * std::numeric_limits<float>::epsilon() is roughly 1e-6.

pfultz2 avatar Apr 25 '24 16:04 pfultz2

The original golden number (= 80) for tolerance was chosen in some heuristic way.

The 80 was chosen because 80 * std::numeric_limits<float>::epsilon() is roughly 1e-6.

This tolerance suggested value for FP32 can not logically apply to a mixed FP32 precision + FP16 precision operation, as the overall precision is effectively lowered. We weren't catching it earlier due to a severely compromised testing range. So changing/lowering these tolerances is the logical thing to do. Rather than compromise the entire suite of testing just to avoid changing tolerances in a very few cases!

lakhinderwalia avatar Apr 25 '24 20:04 lakhinderwalia

This didnt find bugs in those cases. The only bug that was found was in cpu/layernorm.cpp, and you didnt fix the bug, you just change the test case so ti would pass.

First of all, cpu/layernorm was using a wrong default value, that I made it conforming to the onnx standard: 1e-5. Secondly, the original bug that cpu/layernorm.cpp where it doesn't read the eps value from the operator stays, and it should be fixed in a different PR, and I haven't claimed that I have fixed that original issue. This test passes now because the defaults match, not due to something magical, but logical!

This isn't the only bug that was found. Here are 2 more broad cases:

  1. Earlier we were getting a false positive on mixed precision computations, all due to severely compromised test vectors! Now that is fixed. No more false positives. This relates to 4 FP32+16 tests. We have a similar issue with 2 FP64 tests -- which are internally running as FP32 test!

  2. test_quantizelinear_convert (ref version) was in reality comparing against the gpu version (effectively of) test_quantizelinear, because of converts being optimized out for the gpu case. This was never caught! I removed the test just for that reason.

lakhinderwalia avatar Apr 25 '24 20:04 lakhinderwalia

I dont mean overflow in that sense, but since we are using all 24-bits of the mantissa so then when we do addition or multiplication we might need a much larger mantissa to represent the value correctly. We can still have a value but there are many bits that will get chopped off.

There is no easy control on which part of mantissa should/could be used here! This can not be a sound testing criterion.

For instance a look at a few numbers, some use all the bits of mantissa they can, some don't: 1/3 = (3eaaaaab), 1/4 = (3e800000), 1/5 = (3e4ccccd), 1/8 = (3e000000). Guessing the internal representation of numbers via mantissa size is an incorrect approach. Or assuming that some fractional number addition(s) will generate some errors! These are FP values, and we should just work with properly chosen tolerances. Right now there is no need for such a mantissa consideration, there are 500+ tests passing as before!

lakhinderwalia avatar Apr 25 '24 21:04 lakhinderwalia

This tolerance suggested value for FP32 can not logically apply to a mixed FP32 precision + FP16 precision operation

Yea we could pick the FP based on the lowest precision instead of the final precision. Thats what we do in the driver already., but this is really beyond the scope of this PR.

First of all, cpu/layernorm was using a wrong default value, that I made it conforming to the onnx standard: 1e-5

Models dont always use the same epsilon thats why we have it as a variable for layernorm. And the value that was chosen was what was being used by models at the time we added it so its not a wrong value. The cpu backend needs to be updated to use a different epsilon when there is a different one used in the model.

Since models dont always use the same epsilon(some models do not use the onxx layernorm operator) its good to have a test using a different epsilon to make sure why we calculating it correctly.

For instance a look at a few numbers, some use all the bits of mantissa they can, some don't

Sure not all will be low, especially after division or sqrts, but choosing the numbers like this should decrease the probability of it happening.

These are FP values, and we should just work with properly chosen tolerances.

But this moves the tolerances in the wrong direction. We want to choose the test data so we can use much stricter tolerances.

pfultz2 avatar Apr 25 '24 22:04 pfultz2

Either way, I think the test verification updates are out of scope for this. The goal is to improve the RNG for perf runs specifically so I think this PR should just focus on updating RNG for those cases only and we can address verification issues in a later PR.

pfultz2 avatar Apr 25 '24 22:04 pfultz2

But this moves the tolerances in the wrong direction. We want to choose the test data so we can use much stricter tolerances.

Strict tolerances coupled with poorly chosen test vectors! That means False positives in tests & potentially poor performance estimates for fine tuning!

The idea that a test of mixed precision will yield a higher precision result is a deeply flawed assumption, somehow made look good by choosing compromised data.

The scope of this PR is to fix what is broken. And I have fixed all the important issues here: Verify Testing is better. Perf results are more accurate. Also, as a result of better tuning, Run time performance will be better! Thank you.

lakhinderwalia avatar Apr 26 '24 02:04 lakhinderwalia

I'm OK with either 1 or 2. Option 3 would take refactoring a lot of tests so unlikely we'll be doing that.

@CharlieL7, I have made all the changes (and then some): of the first two options that you were leaning towards, I have implemented option #2. Thanks.

lakhinderwalia avatar Apr 26 '24 03:04 lakhinderwalia

Strict tolerances coupled with poorly chosen test vectors!

If we have stricter tolerances then we have chosen better test data in general. CK and rocblas will use exact matching(so no tolerances) with carefully chosen input data for their verification, but they dont use this test data for perf measurement or tuning. We should be doing the same here, have enum to select the data we want for verification and the data we want for perf measurement. We can debate how we want the verification data to be in a later issue or PR.

The idea that a test of mixed precision will yield a higher precision result is a deeply flawed assumption

I see no problem with this assumption, especially if we update the ref to always use the highest precision.

somehow made look good by choosing compromised data.

I see no problem with the test data especially since it allow us to use stricter tolerances.

The scope of this PR is to fix what is broken.

The scope of the issue was to improve RNG for perf measurements, it wasnt to change our verify tests. There are many ways to improve our verification tests and some of them seem are debatable if it is better. Rather than debate this, lets just focus on the main issue of the RNG for the perf measurement by adding an enum to select a different RNG for perf measurement and we can figure out how to improve test data for verification issues at a later time.

And I have fixed all the important issues here: Verify Testing is better.

Like I said, increasing tolerance is not better.

Also, as a result of better tuning, Run time performance will be better!

Sure, so lets not debate the verify tests now, and just update the data for tuning only, and use the old RNG for verification for now.

pfultz2 avatar Apr 29 '24 20:04 pfultz2

(Not wanting to debate.. just adding some clarity for all.. just because of broad statements below)

somehow made look good by choosing compromised data.

I see no problem with the test data especially since it allow us to use stricter tolerances.

Like I said, increasing tolerance is not better.

The main issue addressed in this PR are three fold: catching bugs in code or in testing or when doing a perf. We caught a handful of bugs, on all fronts!

The PR is not about increasing tolerances. There are 500+ tests with NO CHANGE to their existing tolerance, and pass as before.

If we verify as before, we give a free pass to such bugs again! Perf is better. Testing is better!

lakhinderwalia avatar Apr 29 '24 22:04 lakhinderwalia

The main issue addressed in this PR are three fold:

Lets make it one fold, and focus on the perf measurements.

catching bugs in code or in testing

There are no bug fixes to the PR just updates to the tests.

The PR is not about increasing tolerances. There are 500+ tests with NO CHANGE to their existing tolerance, and pass as before.

For the tests that fail you just increase tolerance instead of fixing the test. This could also be a problem for future tests we add, so this can grow further.

I have plans for fixing these in the future, without needing to increase tolerances. The problem with higher tolerances is that when tests get fixed through a better verification apparatus there is no error signaling that we should lower the tolerance later.

Its also good to have two different paths for generating numbers for verification and perf. All of the other component teams use the different RNG for perf and verification, so I think its a good idea for us to do that as well. Using the same can also cause problems for other improvements to the verification in the future(such as ref using the highest precision for golden values which may avoid overflow for very large values). So adding an enum is not something we will remove later.

pfultz2 avatar Apr 29 '24 23:04 pfultz2

In order for this PR to get merged please follow Charlies "less work, more clumsy" guidance. Along with that please add a comment line in src/include/migraphx/generate.hpp explaining what the * 5 or 8 - 1 is. Comments in MIGraphX are lacking in general and seeing a naked number with no definition as to why it is set to it causes confusion.

causten avatar May 06 '24 17:05 causten

Guys, This is something we should try to get right...

  1. Definitely we should be sampling more values than just 32 fixed possible numbers in the range of a datatype.
  2. Definitely the tolerance of mixed precision operations should be dominated by the tolerance of the "smallest" datatype. Actually, the worst case tolerance is the sum of the tolerances. In a statistical sense, the variances (=square tolerances) would add.
  3. Definitely the expected tolerance of any operation depends on the number of operands (e.g. if you are adding or doing mfma) on multiple of them, the expected tolerance of all that scales with the sqrt of the number of operations, and the absolute worse case scales linearly.

This PR seems to address (1) fully, and (2) at least partly (better than prior code), and that's a good thing. It doesn't seem to address (3); instead, we still seem to have those crazy looking tolerances defined as integers. If those integers are defined so that they map to a 10^(-6) float tolerance, that should be implemented as such, i.e. we should calculate the integer that results to 10^-6...

So, overall, i think we should be doing more along the directions that this PR is starting... perhaps in rocm 6.3? We should probably have a meeting about this and pull everybody's ideas into a plan...

hgaspar avatar Jun 10 '24 16:06 hgaspar

Definitely we should be sampling more values than just 32 fixed possible numbers in the range of a datatype.

We may want to tweak this, but for verification we want to do this so we use stricter tolerances. Usually different RNGs are needed for verification and perf tuning, this is waht CK and rocblas does as well, which is why I am asking to add an enum to control this.

Definitely the tolerance of mixed precision operations should be dominated by the tolerance of the "smallest" datatype.

Yes thats how the driver does it, and we should do something similiar in test_verify(but we should use stricter tolerances). We should also use allclose as well because right now test_verify only checks rms error.

Also we should use double-precision for the ref version so we always get the most accurate result.

If those integers are defined so that they map to a 10^(-6) float tolerance, that should be implemented as such, i.e. we should calculate the integer that results to 10^-6...

Yes the way we define it in the driver we use something more like 10^-6, but there are three tolerances to be set not just one.

pfultz2 avatar Jun 12 '24 15:06 pfultz2

Definitely we should be sampling more values than just 32 fixed possible numbers in the range of a datatype.

We may want to tweak this, but for verification we want to do this so we use stricter tolerances. Usually different RNGs are needed for verification and perf tuning, this is waht CK and rocblas does as well, which is why I am asking to add an enum to control this.

There are no different RNGs here. It is the same RNG. Just that its output was previously getting compressed to some ultra narrow meaningless range of just 32 fixed numbers, while assuming that we are using strict tolerances! That assumption of strict tolerances didn't expose the underlying false positives! So this idea of flying with compromised test vectors, along with 'strict tolerances' is just deeply flawed.

This PR should not be gated for fixing all the underlying issues it exposes. We should fix the underlying issues as we go along. The mixed-precision cases have in reality nothing to do this PR, except that they got exposed.

This PR should be accepted as is, there is no place in this code-base for a 32 numbers only test range! All the tweaks have been done in the simplest and complete manner. Thanks.

lakhinderwalia avatar Jun 12 '24 16:06 lakhinderwalia