Caliper icon indicating copy to clipboard operation
Caliper copied to clipboard

Allowed Caliper to control Kokkos fencing behavior

Open DavidPoliakoff opened this issue 3 years ago • 37 comments

Many Kokkos PR's ago we added the ability for a tool to control the fencing behavior of Kokkos. In the case of Caliper, this could minimize overhead in one place currently, and more in the future.

What I do right now is just tell Kokkos only to fence if a callback subscribes to a given event. This could allow us in the future to do things like "only profile Regions" and not induce fences in parallel kernels. There's also room to do extremely clever things with GPU backends where we don't introduce syncs at all using CUPTI, but that's beyond the scope of what I want to do here.

Note that this does introduce some header files, which are extremely backwards compatible, we have about six years of history of these files not breaking anything

DavidPoliakoff avatar Jun 28 '21 15:06 DavidPoliakoff

Hold off on merging this one, there's something we ought to discuss

DavidPoliakoff avatar Jun 28 '21 15:06 DavidPoliakoff

I pinged you on an email, let me know what you think

DavidPoliakoff avatar Jun 28 '21 17:06 DavidPoliakoff

Oh good, you already implemented this, I thought I'd have to deal with the Kokkos fencing stuff now 😄

Do I interpret this correctly that you tell Kokkos "I don't need auto-fencing", but then tell it to fence explicitly in each of the callbacks? Generally Caliper shouldn't need fencing at all. For CUDA, Caliper can map GPU times to the regions that launched them, without any need for explicit fencing (you can give it a try with cuda-activity-report), and we should be able to do similar things for other programming models. The only thing that might need fencing currently would be the map-UVM-events-to-Kokkos-spaces analysis. So I think it would be better if Caliper didn't fence by default. Maybe make it optional?

TBH I still don't understand why the tool interface should have to concern itself with that, that should be a switch in Kokkos that the user can set. Oh well. And as Caliper shouldn't typically require fencing at all, I can't spontaneously think of any scenario where it would be beneficial to have more fine-grained control over it.

daboehme avatar Jun 28 '21 20:06 daboehme

Do I interpret this correctly that you tell Kokkos "I don't need auto-fencing", but then tell it to fence explicitly in each of the callbacks? Generally Caliper shouldn't need fencing at all. For CUDA, Caliper can map GPU times to the regions that launched them, without any need for explicit fencing (you can give it a try with cuda-activity-report), and we should be able to do similar things for other programming models. The only thing that might need fencing currently would be the map-UVM-events-to-Kokkos-spaces analysis. So I think it would be better if Caliper didn't fence by default. Maybe make it optional?

!!!!!!! Very cool, let me test that out. If this works, then yeah, delete fences, and get ready for Sandia users to beat down your door... Should be just CALI_CONFIG=cuda-activity-report(profile.kokkos)?

DavidPoliakoff avatar Jun 28 '21 20:06 DavidPoliakoff

I think it should be optional, in any case, due to the reasons I outlined in my email (you're not really just timing your own kernel, in a multi-stream case)

DavidPoliakoff avatar Jun 28 '21 20:06 DavidPoliakoff

Yeah, CALI_CONFIG="cuda-activity-report(profile.kokkos)" should work. Here's some more information about these reports: https://software.llnl.gov/Caliper/CUDA.html

daboehme avatar Jun 28 '21 20:06 daboehme

Okay, so I think I'll write the logic as follows:

  1. If we're using CUDA, and doing a cuda-activity-report, then no fencing is needed.
  2. Otherwise, do global fencing

And then as you gain the ability to do HIP fence-free profiling, we just keep adding to case 1

DavidPoliakoff avatar Jun 28 '21 21:06 DavidPoliakoff

Might be easier to just give the kokkos service a config flag that turns on fencing. Is there actually another situation besides the UVM event mapping where we really need it? Do people insist on synchronous times with e.g. a runtime-report?

daboehme avatar Jun 28 '21 21:06 daboehme

So, the problem is that if you don't do synchronous timings, you confuse your users. You claim that a dot-product on three elements took a minute, when in reality a kernel on another stream hogged the device for all but a few microseconds of that time. By the way, I was writing your side of this argument to Christian earlier today. But think about it, you write

call_huge_calculation_on_cuda(stream2);
call_small_dot_product(stream1);

Depending on how the scheduler works, you'll happily submit on stream2, then stream1, and then the big calculation hogs the device. But you've still started your timing on the kernel in stream1 (unless Caliper is doing something very interesting?). Your users write "why does Caliper think a dot product takes a minute" and then everybody is confused.

DavidPoliakoff avatar Jun 28 '21 21:06 DavidPoliakoff

Depending on how the scheduler works, you'll happily submit on stream2, then stream1, and then the big calculation hogs the device. But you've still started your timing on the kernel in stream1 (unless Caliper is doing something very interesting?)

Timing via Cupti isn't like timing on a CPU. You don't explicitly start/stop anything. Each kernel submission in CUPTI has a unique ID so as long as Caliper associates the Kokkos string with that ID before Kokkos launches another kernel, Caliper (or any tool using Cupti) is capable of resolving your theoretical scenario

jrmadsen avatar Jun 28 '21 23:06 jrmadsen

First: hi! Hope you're enjoying AMD

Second: It can resolve it, but the timings won't necessarily be accurate in the sense of "how long does my dot kernel take to run," it might answer "how long does my dot kernel take to get scheduled, possibly delayed by other kernels, and then eventually run." Which is not at all a useless answer, but it's also not always the answer a user will want

DavidPoliakoff avatar Jun 28 '21 23:06 DavidPoliakoff

First: hi! Hope you're enjoying AMD

Had my first day today. I haven't sent out an email yet (only bc of procrastination) so I suspect this news to @daboehme 🤣

timings won't necessarily be accurate in the sense of "how long does my dot kernel take to run,"

What I'm saying is that it will be accurate for how long the kernel takes to run bc that's exactly what a cupti activity record provides: the runtime for the kernel, absent everything else. What you are describing with

"how long does my dot kernel take to get scheduled, possibly delayed by other kernels, and then eventually run."

is what you would get if you used cudaEvent timings, not cupti

jrmadsen avatar Jun 28 '21 23:06 jrmadsen

Ah! That's actually news to me. Cool, in that case I don't see a reason to ever fence. And congrats on the first day, and getting to surprise Boehme, that's always a good day.

DavidPoliakoff avatar Jun 28 '21 23:06 DavidPoliakoff

First: hi! Hope you're enjoying AMD

Had my first day today. I haven't sent out an email yet (only bc of procrastination) so I suspect this news to @daboehme

It is! Well, congratulations :smile:

Ah! That's actually news to me. Cool, in that case I don't see a reason to ever fence.

Exactly! Well, unless you do weird stuff like the UVM event to kokkos space mapping. Otherwise I think all the artificial fencing is probably more harmful than beneficial for understanding performance.

daboehme avatar Jun 29 '21 00:06 daboehme

Path                                             Host Time GPU Time GPU %
cudaFreeHost                                      0.000039
test_two                                          3.329040 1.532707  46.040502
  Kokkos::Tools::Experim~~: Tool Requested Fence  1.671173
    cudaDeviceSynchronize                         1.026064
  cudaLaunchKernel                                0.614629 1.532707 249.371064
  cudaFuncSetCacheConfig                          0.000005
  cudaFuncGetAttributes                           0.000007
test_one                                          3.272529 1.533013  46.844918
  Kokkos::Tools::Experim~~: Tool Requested Fence  1.669054
    cudaDeviceSynchronize                         1.022471
  cudaLaunchKernel                                0.557666 1.533013 274.898070
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000008
Kokkos::View::initialization [v2]                 0.000113 0.000030  26.590754
  Kokkos::Tools::Experim~~: Tool Requested Fence  0.000025
    cudaDeviceSynchronize                         0.000012
  Kokkos::Impl::ViewValu~~setting values in view  0.000030
    cudaStreamSynchronize                         0.000021
  cudaLaunchKernel                                0.000017 0.000030 173.327938
  cudaPointerGetAttributes                        0.000006
Kokkos::View::initialization [v1]                 0.000182 0.000031  17.277801
  Kokkos::Tools::Experim~~: Tool Requested Fence  0.000024
    cudaDeviceSynchronize                         0.000011
  Kokkos::Impl::ViewValu~~setting values in view  0.000029
    cudaStreamSynchronize                         0.000018
  cudaLaunchKernel                                0.000016 0.000030 189.246237
  cudaMemcpyToSymbol                              0.000017 0.000001   6.528709
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000013
  cudaPointerGetAttributes                        0.000014
Kokkos::Tools::Experime~~e: Tool Requested Fence  2.237524
  cudaDeviceSynchronize                           0.955436
cudaDeviceSetCacheConfig                          0.000013
cudaMemset                                        0.000046
cudaHostAlloc                                     0.000026
cudaFree                                          0.000563
cudaLaunchKernel                                  0.000048 0.000006  13.201375
cudaMemcpy                                        0.000488 0.000041   8.321780
cudaMalloc                                        0.000409
Kokkos::CudaInternal::i~~on space initialization  0.000057
  cudaDeviceSynchronize                           0.000036
cudaSetDevice                                     0.000016
cudaStreamCreate                                  0.000694
(base) [dzpolia@kokkos-dev-2 cuda-build]$ ./core/unit_test/KokkosCore_UnitTest_Develop --kokkos-tools-library=$HOME/src/caliper/nkb/src/libcaliper.so --kokkos-tools-args="cuda-activity-report(profile.kokkos)"
(REBUILD WITHOUT FENCES)
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from defaultdevicetype
[ RUN      ] defaultdevicetype.development_test
[       OK ] defaultdevicetype.development_test (3125 ms)
[----------] 1 test from defaultdevicetype (3125 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3125 ms total)
[  PASSED  ] 1 test.
Path                                             Host Time GPU Time GPU %
cudaFreeHost                                      0.000037
test_two                                          1.293494 1.391292 107.560713
  cudaLaunchKernel                                0.833609 1.391292 166.899775
  cudaFuncSetCacheConfig                          0.000008
  cudaFuncGetAttributes                           0.000007
test_one                                          1.229367 1.391136 113.158774
  cudaLaunchKernel                                0.769919 1.391136 180.686070
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000008

Nifty result. Actually cool in both cases, Caliper giving you some visibility into what Kokkos does under the hood. I'll refine that

DavidPoliakoff avatar Jun 29 '21 20:06 DavidPoliakoff

Actually, here's the cool result (with some refinement:

(base) [dzpolia@kokkos-dev-2 cuda-build]$ ./core/unit_test/KokkosCore_UnitTest_Develop --kokkos-tools-library=$HOME/src/caliper/nkb/src/libcaliper.so --kokkos-tools-args="cuda-activity-report(profile.kokkos)"
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from defaultdevicetype
[ RUN      ] defaultdevicetype.development_test
[       OK ] defaultdevicetype.development_test (3117 ms)
[----------] 1 test from defaultdevicetype (3118 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3118 ms total)
[  PASSED  ] 1 test.
Path                                             Host Time GPU Time GPU %
cudaFreeHost                                      0.000038
test_two                                          1.263474 1.391103 110.101493
  cudaLaunchKernel                                0.781364 1.391103 178.035232
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000008
test_one                                          1.225919 1.391138 113.477202
  cudaLaunchKernel                                0.738680 1.391138 188.327482
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000008
Kokkos::View::initialization [v2]                 0.000077 0.000030  39.004349
  Kokkos::Impl::ViewValu~~setting values in view  0.000021
    cudaStreamSynchronize                         0.000011
  cudaLaunchKernel                                0.000026 0.000030 116.596818
  cudaPointerGetAttributes                        0.000006
Kokkos::View::initialization [v1]                 0.000161 0.000032  19.537230
  Kokkos::Impl::ViewValu~~setting values in view  0.000028
    cudaStreamSynchronize                         0.000016
  cudaLaunchKernel                                0.000018 0.000030 171.277255
  cudaMemcpyToSymbol                              0.000022 0.000001   5.140208
  cudaFuncSetCacheConfig                          0.000006
  cudaFuncGetAttributes                           0.000013
  cudaPointerGetAttributes                        0.000015
cudaDeviceSetCacheConfig                          0.000012
cudaMemset                                        0.000048
cudaHostAlloc                                     0.000026
cudaFree                                          0.000417
cudaLaunchKernel                                  0.000048 0.000006  12.817001
cudaMemcpy                                        0.028340 0.000039   0.138658
cudaMalloc                                        0.000368
Kokkos::CudaInternal::i~~on space initialization  0.000088
  cudaDeviceSynchronize                           0.000048
cudaSetDevice                                     0.000016
cudaStreamCreate                                  0.000647
(base) [dzpolia@kokkos-dev-2 cuda-build]$ ./core/unit_test/KokkosCore_UnitTest_Develop --kokkos-tools-library=$HOME/src/caliper/nkb/src/libcaliper.so --kokkos-tools-args="runtime-report(profile.kokkos)"
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from defaultdevicetype
[ RUN      ] defaultdevicetype.development_test
[       OK ] defaultdevicetype.development_test (6608 ms)
[----------] 1 test from defaultdevicetype (6608 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (6608 ms total)
[  PASSED  ] 1 test.
Path                                             Time (E) Time (I) Time % (E) Time % (I)
test_two                                         0.864759 2.130433  13.085650  32.237999
  Kokkos::Tools::Experim~~: Tool Requested Fence 1.265674 1.265674  19.152349  19.152349
test_one                                         0.871407 2.135599  13.186248  32.316171
  Kokkos::Tools::Experim~~: Tool Requested Fence 1.264192 1.264192  19.129923  19.129923
Kokkos::View::initialization [v2]                0.000031 0.000072   0.000469   0.001090
  Kokkos::Tools::Experim~~: Tool Requested Fence 0.000010 0.000010   0.000151   0.000151
  Kokkos::Impl::ViewValu~~setting values in view 0.000031 0.000031   0.000469   0.000469
Kokkos::View::initialization [v1]                0.000076 0.000117   0.001150   0.001770
  Kokkos::Tools::Experim~~: Tool Requested Fence 0.000011 0.000011   0.000166   0.000166
  Kokkos::Impl::ViewValu~~setting values in view 0.000030 0.000030   0.000454   0.000454
Kokkos::Tools::Experime~~e: Tool Requested Fence 0.828718 0.828718  12.540272  12.540272
Kokkos::CudaInternal::i~~on space initialization 0.000028 0.000028   0.000424   0.000424

Note how Caliper reports the fences it induces. So in the runtime-report case, it needs fences. In the cuda-activity-report case, it doesn't. What's fun here is that as you develop HIP capability to do the same, I just add "hip-activity-report" to the list of things that don't require fencing, but until you do, people get accurate results with runtime-report. And they see how much time is coming from tool induced fencing, if they want to adjust for that. This is a very neat thing. I'll push code

DavidPoliakoff avatar Jun 29 '21 22:06 DavidPoliakoff

@daboehme also note the total test runtime. This is a pathological example, but we're cutting runtime in half on the test

DavidPoliakoff avatar Jun 29 '21 22:06 DavidPoliakoff

Is this ready to merge now @DavidPoliakoff ?

daboehme avatar Jul 01 '21 19:07 daboehme

In my book it is, if you think it's good to go, merge it

DavidPoliakoff avatar Jul 02 '21 14:07 DavidPoliakoff

Ping

DavidPoliakoff avatar Jul 13 '21 14:07 DavidPoliakoff

Oh, sorry for the delay. I took a closer look at this again - it seems the only thing that disables the fencing now is if cuda-activity-report is active? Seems a bit weak - I think it would be good to have an explicit switch as well, like a CALI_KOKKOS_ENABLE_FENCING config variable. Still fine setting the defaults appropriately based on which config we detect.

daboehme avatar Jul 13 '21 17:07 daboehme

Why turn it off, though? If I'm not doing cuda-activity-report, the answers I get will be garbage, no?

DavidPoliakoff avatar Jul 13 '21 18:07 DavidPoliakoff

Why turn it off, though? If I'm not doing cuda-activity-report, the answers I get will be garbage, no?

There are other configs (cuda-activity-profile) which also work for asynchronous tasks. Or you can have custom configs.

daboehme avatar Jul 13 '21 20:07 daboehme

What else works for asynchronous tasks? I basically want to limit it to those, as people will use the custom config case, and not know what they're doing, and ask me why Kokkos/Caliper says their kernels only take 30 microseconds

DavidPoliakoff avatar Jul 13 '21 20:07 DavidPoliakoff

There's the cuda-activity-profile (in addition to -report), whatever we'll do for HIP, etc. Plus maybe you want to see the asynchronous times. I'm just saying it would be nice to be more configurable for people who do know what they're doing.

daboehme avatar Jul 13 '21 20:07 daboehme

I'd rather maintain a list. The asynchronous times are completely meaningless. Ratio of "people who will be curious to see kernel times in a non-asynchrony-aware case" to "people who will ask me why the config their buddy gave them that says the kernel they know takes 1 second takes exactly the length of a CUDA/HIP kernel launch" is way too high for me to want to expose that option. I'll ask in the group about this, though.

DavidPoliakoff avatar Jul 13 '21 20:07 DavidPoliakoff

An aggressive proposal might be for configs to say what resources they require fences of (this is independently useful information). Then Kokkos could just query "ah, does the requested config need fencing of [resource]?"

DavidPoliakoff avatar Jul 13 '21 20:07 DavidPoliakoff

An aggressive proposal might be for configs to say what resources they require fences of (this is independently useful information). Then Kokkos could just query "ah, does the requested config need fencing of [resource]?"

Yeah, but that would introduce additional requirements in services that don't necessarily have much to do with kokkos. Unfortunately there is not really a reliable automatic way to determine if you need fencing or not, you just have to encode that somewhere. It can be left on by default if you think it confuses users otherwise.

daboehme avatar Jul 13 '21 21:07 daboehme

It actually provides a lot of value outside of Kokkos. If a user wants to know "what are the configurations that allow me to get asynchronous timings, and CUDA, and produce a trace" for example, they currently look in docs, perhaps? But if configs documented their capabilities, we could allow people to query these kinds of things.

Just a thought, not one I'm pushing too hard. What I am pushing very hard against is making this an independent configuration setting. I think it is a very, very, very bad idea, that will become one of the main sources of questions I get about why Caliper is doing something unintuitive. If you tell me "this needs to be an independent setting" I'll do it, but I think it's a bad move to separate this from a list of configs, that buys us nothing but bug reports we'll hate to debug.

Edit to add: I don't mean that last aggressively, I just want to highlight how dangerous I think the idea is

DavidPoliakoff avatar Jul 13 '21 21:07 DavidPoliakoff

If you don't want a global config variable, maybe just make turning it off explicitly available via the kokkosp_parse_args callback so that it's available to the very few users that want it but doesn't add unnecessary clutter to the global configuration.

jrmadsen avatar Jul 14 '21 02:07 jrmadsen