TensorComprehensions icon indicating copy to clipboard operation
TensorComprehensions copied to clipboard

Running generated CUDA kernel outside of PyTorch

Open concretevitamin opened this issue 7 years ago • 9 comments

Hi,

I'm interested in running a TC-generated CUDA kernel outside of PyTorch. Currently, I'm using the TC options to specify grid and block dim3. E.g., with

    .mapToThreads(320)
    .mapToBlocks(32, 320)

from TC, I launch the auto-generated kernel (the __global__ func in /tmp/<tc>cuda) with the following:

dim3 grid(32,320);
dim3 block(320);
tc_kernel<<<grid, block>>> ( arguments with correct shapes; output buffer zero-out'd )

However, this seems to produce incorrect values compared to a reference implementation. Am I missing anything? Is there other necessary setup for a TC kernel to work standalone?

concretevitamin avatar Jun 01 '18 14:06 concretevitamin

TC removes blocks and threads that do nothing. .mapToThreads(320) does not mean the kernel will be effectively launched with 320 threads per block. And the kernel code assumes it is going to be executed with a specific number of threads and blocks to remove unnecessary conditions from the code.

Try tc.GlobalDebugInit(["--debug_tc_mapper=true", "--logtosdterr"]) and see what it outputs around the words "tightened launch bounds". These should be grid and block sizes effectively used for launching the kernel.

ftynse avatar Jun 01 '18 14:06 ftynse

@ftynse Thanks. I'm using the conda-installed version of TC, commit git_version: "8e112e9dccda62c30ef29208a827e783b9a7f156" where --logtosdterr is not available. Is there a workaround? Fundamentally, is there a way to figure out the launch config from already-tuned <hash>.{cuda,options} files?

Also, an orthogonal question. Let's say I previously had tuned a kernel with these cached output files:

/tmp/<hash>.cuda
/tmp/<hash>.options

If I want to start autotuning process off of this already-tuned kernel, do I pass layer.autotune(..., cache='/tmp/<hash>')? I'm seeing 100x worse "best" timing when I do this.

concretevitamin avatar Jun 02 '18 02:06 concretevitamin

@concretevitamin the commit mentioned is pretty ancient, any chance you could build from source using the new build system (see the new build instructions)? This way you would have an up-to-date version of TC and get fixes as they come. If that is too inconvenient you can also wait until we push a new TC conda package, it will take a few more days though.

Regarding the caching and iterating, we have been using the approach successfully from C++. There may be something lurking on the python side that we missed so a repro would always be useful. Note that we deprecated the cuda cache and only keep the topK best options (defaults to 10).

nicolasvasilache avatar Jun 04 '18 01:06 nicolasvasilache

@concretevitamin in particular, if you only want to use in Python and don't care about C++ dev or benchmarks then #470 should be pretty easy to follow.

nicolasvasilache avatar Jun 04 '18 01:06 nicolasvasilache

where --logtosdterr is not available.

Well, I've made a typo and it should be --logtostderr.

Fundamentally, is there a way to figure out the launch config from already-tuned .{cuda,options} files?

No. I would not have suggested to look at the debug output had there been such a way.

ftynse avatar Jun 04 '18 06:06 ftynse

On Sun, Jun 03, 2018 at 11:09:09PM -0700, ftynse wrote:

Fundamentally, is there a way to figure out the launch config from already-tuned .{cuda,options} files?

No. I would not have suggested to look at the debug output had there been such a way.

Hmm... isn't the point that we should store this information somewhere?

skimo

skimo-openhub avatar Jun 04 '18 08:06 skimo-openhub

Hmm... isn't the point that we should store this information somewhere?

If we had stored the generated code in the actual codebase, then the answer would have been yes. Codegen returns the launch bounds, now it's a matter of exposing the codegen call itself to python. The caller can do whatever it wants with the results.

ftynse avatar Jun 04 '18 08:06 ftynse

@ftynse @nicolasvasilache I will give building from source a try.

Regarding whether or not correct launch bounds should be stored on disk after auto-tuning: it seems obvious it should be stored, otherwise how can one reuse the tuned kernels across sessions? An analogy I can think of is having successfully trained a NN but without storing the weights :)

concretevitamin avatar Jun 08 '18 03:06 concretevitamin

Well, this is not how TC tuner was designed. It does not produce CUDA, but mapping options. Storing CUDA code is merely a side effect of running the kernel. I think we actually killed that storage completely in the master branch.

If you need the kernel and bounds description, give those options to the TC compiler and it will produce the desired result. Python interface seems to be missing the proper call for this, which has to be addressed. Nothing more.

Picking up your analogy, autotuner is more like comparing different NNs for test error. You keep the best architecture, but not necessarily the test set.

On Fri, Jun 8, 2018, 05:17 Zongheng Yang [email protected] wrote:

@ftynse https://github.com/ftynse @nicolasvasilache https://github.com/nicolasvasilache I will give building from source a try.

Regarding whether or not correct launch bounds should be stored on disk after auto-tuning: it seems obvious it should be stored, otherwise how can one reuse the tuned kernels across sessions? An analogy I can think of is having successfully trained a NN but without storing the weights :)

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/facebookresearch/TensorComprehensions/issues/466#issuecomment-395633847, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcTa1qWaVr6P3b80WUidLxWN6tFE_OCks5t6ezngaJpZM4UWxWB .

ftynse avatar Jun 08 '18 07:06 ftynse