Description

Rework PDL for quantization in #2001 and #2066.

Add two quantization configs

pdl_sync: Add cudaGridDependencySynchronize to the first quantization kernel, to make sure the previous unknown kernel has flushed results to the global memory. The following kernels are launched to the same stream, and there's no data dependency between them, so they don't need this sync.
pdl_trigger: Add cudaTriggerProgrammaticLaunchCompletion to all but the last quantization kernel, so it won't trigger the next unknown kernel.

Type of change

[ ] Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Infra/Build change
[ ] Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

[ ] I have read and followed the contributing guidelines
[ ] The functionality is complete
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] My changes generate no new warnings
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] New and existing unit tests pass locally with my changes

Sep 04 '25 05:09 yaox12

/te-ci

Sep 04 '25 05:09 yaox12

/te-ci

Sep 05 '25 01:09 yaox12

My understanding is that we have control of both edges between kernels, we can modify the launch of the current kernel with cudaLaunchKernelEx and we can modify if the next kernel can launch early via cudaTriggerProgrammaticLaunchCompletion inside the current kernel. If this understanding is incorrect, lmk, thanks!

kernel1-->kernel2-->kernel3

This PR prevents the first kernel in a series of grouped quantize from launching with PDL. The current implementation is blocking kernel1 from launching too early before data is ready, but since the launch and triggering of the next kernel is controlled by the same enable_pdl flag, isn't it also preventing kernel2 from launching early since there won't be a cudaTriggerProgrammaticLaunchCompletion(); call in kernel1? So we don't get the benefit until kernel3

On the other hand, kernel3 is okay to launch early but do we want to prevent it from calling cudaTriggerProgrammaticLaunchCompletion() internally so any unknown subsequent kernels wouldn't launch early. If the subsequent kernels did launch early it'd be a bug in them too, but just to be safe we can avoid the final call to cudaTriggerProgrammaticLaunchCompletion() in this sequence of kernels (the call in kernel3) without any performance penalty, right?

Sep 05 '25 17:09 jberchtold-nvidia

My understanding is that we have control of both edges between kernels, we can modify the launch of the current kernel with cudaLaunchKernelEx and we can modify if the next kernel can launch early via cudaTriggerProgrammaticLaunchCompletion inside the current kernel. If this understanding is incorrect, lmk, thanks!
kernel1-->kernel2-->kernel3
This PR prevents the first kernel in a series of grouped quantize from launching with PDL. The current implementation is blocking kernel1 from launching too early before data is ready, but since the launch and triggering of the next kernel is controlled by the same enable_pdl flag, isn't it also preventing kernel2 from launching early since there won't be a cudaTriggerProgrammaticLaunchCompletion(); call in kernel1? So we don't get the benefit until kernel3

On the other hand, kernel3 is okay to launch early but do we want to prevent it from calling cudaTriggerProgrammaticLaunchCompletion() internally so any unknown subsequent kernels wouldn't launch early. If the subsequent kernels did launch early it'd be a bug in them too, but just to be safe we can avoid the final call to cudaTriggerProgrammaticLaunchCompletion() in this sequence of kernels (the call in kernel3) without any performance penalty, right?

I got your point. We need to handle the trigger from the kernel and the launch attribute separately. That makes sense.

Sep 08 '25 02:09 yaox12

/te-ci

Sep 08 '25 07:09 yaox12

@jberchtold-nvidia Now I use two configs to control the behavior of PDL:

pdl_sync: Add cudaGridDependencySynchronize to the first kernel, to make sure the previous unknown kernel has flushed results to the global memory. The following kernels are launched to the same stream, and there's no data dependency between them, so they don't need this sync.
pdl_trigger: Add cudaTriggerProgrammaticLaunchCompletion to all but the last kernel, so it won't trigger the next unknown kernel.

At the host side, we always set the cudaLaunchAttributeProgrammaticStreamSerialization attribute. The behavior still depends the kernel side cudaGridDependencySynchronize / cudaTriggerProgrammaticLaunchCompletion. If there's no sync nor trigger, the kernel would behave as a normal one without PDL.

Sep 08 '25 07:09 yaox12

@jberchtold-nvidia Now I use two configs to control the behavior of PDL:

pdl_sync: Add cudaGridDependencySynchronize to the first kernel, to make sure the previous unknown kernel has flushed results to the global memory. The following kernels are launched to the same stream, and there's no data dependency between them, so they don't need this sync.

pdl_trigger: Add cudaTriggerProgrammaticLaunchCompletion to all but the last kernel, so it won't trigger the next unknown kernel.

At the host side, we always set the cudaLaunchAttributeProgrammaticStreamSerialization attribute. The behavior still depends the kernel side cudaGridDependencySynchronize / cudaTriggerProgrammaticLaunchCompletion. If there's no sync nor trigger, the kernel would behave as a normal one without PDL.

Thanks! Using cudaGridDependencySynchronize seems like a good idea. The PR LGTM from my side but will defer to Przemek or Tim for final approval since I'm not as familiar in using PDL. Thanks for the PR!

Sep 08 '25 15:09 jberchtold-nvidia

/te-ci

Sep 22 '25 07:09 yaox12

@yaox12 @timmoon10 - This PR is has conflicts. I don't know if it is because the PR needs to fixed or problems with the CI. Would you please say what is the next step here to fix the CI issues?

Oct 15 '25 18:10 nvMelissa

@yaox12 @timmoon10 - This PR is has conflicts. I don't know if it is because the PR needs to fixed or problems with the CI. Would you please say what is the next step here to fix the CI issues?

I'm working on fixing the conflicts and CI issues.

Oct 16 '25 05:10 yaox12

/te-ci

Oct 16 '25 05:10 yaox12

/te-ci

Oct 17 '25 01:10 yaox12

/te-ci

Oct 21 '25 03:10 yaox12

Ready for review. The CI failures are irrelevant.

Oct 21 '25 05:10 yaox12

Close this PR because

We prefer to use a grouped quantize to further reduce the CPU overhead.
There're parallel work on optimizing the quantization kernels, which makes the PDL hard to maintain.

Nov 20 '25 05:11 yaox12

[Common][PyTorch][Rework] PDL for Quantization

Description

Type of change

Changes

Checklist: