TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

[Common][PyTorch][Rework] PDL for Quantization

Open yaox12 opened this issue 4 months ago • 14 comments

Description

Rework PDL for quantization in #2001 and #2066.

Add two quantization configs

  • pdl_sync: Add cudaGridDependencySynchronize to the first quantization kernel, to make sure the previous unknown kernel has flushed results to the global memory. The following kernels are launched to the same stream, and there's no data dependency between them, so they don't need this sync.
  • pdl_trigger: Add cudaTriggerProgrammaticLaunchCompletion to all but the last quantization kernel, so it won't trigger the next unknown kernel.

Type of change

  • [ ] Documentation change (change only to the documentation, either a fix or a new content)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] Infra/Build change
  • [ ] Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • [ ] I have read and followed the contributing guidelines
  • [ ] The functionality is complete
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] My changes generate no new warnings
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] New and existing unit tests pass locally with my changes

yaox12 avatar Sep 04 '25 05:09 yaox12

/te-ci

yaox12 avatar Sep 04 '25 05:09 yaox12

/te-ci

yaox12 avatar Sep 05 '25 01:09 yaox12

My understanding is that we have control of both edges between kernels, we can modify the launch of the current kernel with cudaLaunchKernelEx and we can modify if the next kernel can launch early via cudaTriggerProgrammaticLaunchCompletion inside the current kernel. If this understanding is incorrect, lmk, thanks!

kernel1-->kernel2-->kernel3

This PR prevents the first kernel in a series of grouped quantize from launching with PDL. The current implementation is blocking kernel1 from launching too early before data is ready, but since the launch and triggering of the next kernel is controlled by the same enable_pdl flag, isn't it also preventing kernel2 from launching early since there won't be a cudaTriggerProgrammaticLaunchCompletion(); call in kernel1? So we don't get the benefit until kernel3

On the other hand, kernel3 is okay to launch early but do we want to prevent it from calling cudaTriggerProgrammaticLaunchCompletion() internally so any unknown subsequent kernels wouldn't launch early. If the subsequent kernels did launch early it'd be a bug in them too, but just to be safe we can avoid the final call to cudaTriggerProgrammaticLaunchCompletion() in this sequence of kernels (the call in kernel3) without any performance penalty, right?

jberchtold-nvidia avatar Sep 05 '25 17:09 jberchtold-nvidia

My understanding is that we have control of both edges between kernels, we can modify the launch of the current kernel with cudaLaunchKernelEx and we can modify if the next kernel can launch early via cudaTriggerProgrammaticLaunchCompletion inside the current kernel. If this understanding is incorrect, lmk, thanks!

kernel1-->kernel2-->kernel3

This PR prevents the first kernel in a series of grouped quantize from launching with PDL. The current implementation is blocking kernel1 from launching too early before data is ready, but since the launch and triggering of the next kernel is controlled by the same enable_pdl flag, isn't it also preventing kernel2 from launching early since there won't be a cudaTriggerProgrammaticLaunchCompletion(); call in kernel1? So we don't get the benefit until kernel3

On the other hand, kernel3 is okay to launch early but do we want to prevent it from calling cudaTriggerProgrammaticLaunchCompletion() internally so any unknown subsequent kernels wouldn't launch early. If the subsequent kernels did launch early it'd be a bug in them too, but just to be safe we can avoid the final call to cudaTriggerProgrammaticLaunchCompletion() in this sequence of kernels (the call in kernel3) without any performance penalty, right?

I got your point. We need to handle the trigger from the kernel and the launch attribute separately. That makes sense.

yaox12 avatar Sep 08 '25 02:09 yaox12

/te-ci

yaox12 avatar Sep 08 '25 07:09 yaox12

@jberchtold-nvidia Now I use two configs to control the behavior of PDL:

  • pdl_sync: Add cudaGridDependencySynchronize to the first kernel, to make sure the previous unknown kernel has flushed results to the global memory. The following kernels are launched to the same stream, and there's no data dependency between them, so they don't need this sync.
  • pdl_trigger: Add cudaTriggerProgrammaticLaunchCompletion to all but the last kernel, so it won't trigger the next unknown kernel.

At the host side, we always set the cudaLaunchAttributeProgrammaticStreamSerialization attribute. The behavior still depends the kernel side cudaGridDependencySynchronize / cudaTriggerProgrammaticLaunchCompletion. If there's no sync nor trigger, the kernel would behave as a normal one without PDL.

yaox12 avatar Sep 08 '25 07:09 yaox12

@jberchtold-nvidia Now I use two configs to control the behavior of PDL:

  • pdl_sync: Add cudaGridDependencySynchronize to the first kernel, to make sure the previous unknown kernel has flushed results to the global memory. The following kernels are launched to the same stream, and there's no data dependency between them, so they don't need this sync.
  • pdl_trigger: Add cudaTriggerProgrammaticLaunchCompletion to all but the last kernel, so it won't trigger the next unknown kernel.

At the host side, we always set the cudaLaunchAttributeProgrammaticStreamSerialization attribute. The behavior still depends the kernel side cudaGridDependencySynchronize / cudaTriggerProgrammaticLaunchCompletion. If there's no sync nor trigger, the kernel would behave as a normal one without PDL.

Thanks! Using cudaGridDependencySynchronize seems like a good idea. The PR LGTM from my side but will defer to Przemek or Tim for final approval since I'm not as familiar in using PDL. Thanks for the PR!

jberchtold-nvidia avatar Sep 08 '25 15:09 jberchtold-nvidia

/te-ci

yaox12 avatar Sep 22 '25 07:09 yaox12

@yaox12 @timmoon10 - This PR is has conflicts. I don't know if it is because the PR needs to fixed or problems with the CI. Would you please say what is the next step here to fix the CI issues?

nvMelissa avatar Oct 15 '25 18:10 nvMelissa

@yaox12 @timmoon10 - This PR is has conflicts. I don't know if it is because the PR needs to fixed or problems with the CI. Would you please say what is the next step here to fix the CI issues?

I'm working on fixing the conflicts and CI issues.

yaox12 avatar Oct 16 '25 05:10 yaox12

/te-ci

yaox12 avatar Oct 16 '25 05:10 yaox12

/te-ci

yaox12 avatar Oct 17 '25 01:10 yaox12

/te-ci

yaox12 avatar Oct 21 '25 03:10 yaox12

Ready for review. The CI failures are irrelevant.

yaox12 avatar Oct 21 '25 05:10 yaox12

Close this PR because

  1. We prefer to use a grouped quantize to further reduce the CPU overhead.
  2. There're parallel work on optimizing the quantization kernels, which makes the PDL hard to maintain.

yaox12 avatar Nov 20 '25 05:11 yaox12