[Common][PyTorch][Rework] PDL for Quantization
Description
Rework PDL for quantization in #2001 and #2066.
Add two quantization configs
pdl_sync: AddcudaGridDependencySynchronizeto the first quantization kernel, to make sure the previous unknown kernel has flushed results to the global memory. The following kernels are launched to the same stream, and there's no data dependency between them, so they don't need this sync.pdl_trigger: AddcudaTriggerProgrammaticLaunchCompletionto all but the last quantization kernel, so it won't trigger the next unknown kernel.
Type of change
- [ ] Documentation change (change only to the documentation, either a fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refactoring
Changes
Please list the changes introduced in this PR:
- Change A
- Change B
Checklist:
- [ ] I have read and followed the contributing guidelines
- [ ] The functionality is complete
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
/te-ci
/te-ci
My understanding is that we have control of both edges between kernels, we can modify the launch of the current kernel with cudaLaunchKernelEx and we can modify if the next kernel can launch early via cudaTriggerProgrammaticLaunchCompletion inside the current kernel. If this understanding is incorrect, lmk, thanks!
kernel1-->kernel2-->kernel3
This PR prevents the first kernel in a series of grouped quantize from launching with PDL. The current implementation is blocking kernel1 from launching too early before data is ready, but since the launch and triggering of the next kernel is controlled by the same enable_pdl flag, isn't it also preventing kernel2 from launching early since there won't be a cudaTriggerProgrammaticLaunchCompletion(); call in kernel1? So we don't get the benefit until kernel3
On the other hand, kernel3 is okay to launch early but do we want to prevent it from calling cudaTriggerProgrammaticLaunchCompletion() internally so any unknown subsequent kernels wouldn't launch early. If the subsequent kernels did launch early it'd be a bug in them too, but just to be safe we can avoid the final call to cudaTriggerProgrammaticLaunchCompletion() in this sequence of kernels (the call in kernel3) without any performance penalty, right?
My understanding is that we have control of both edges between kernels, we can modify the launch of the current kernel with
cudaLaunchKernelExand we can modify if the next kernel can launch early viacudaTriggerProgrammaticLaunchCompletioninside the current kernel. If this understanding is incorrect, lmk, thanks!kernel1-->kernel2-->kernel3This PR prevents the first kernel in a series of grouped quantize from launching with PDL. The current implementation is blocking kernel1 from launching too early before data is ready, but since the launch and triggering of the next kernel is controlled by the same enable_pdl flag, isn't it also preventing kernel2 from launching early since there won't be a
cudaTriggerProgrammaticLaunchCompletion();call in kernel1? So we don't get the benefit until kernel3On the other hand, kernel3 is okay to launch early but do we want to prevent it from calling
cudaTriggerProgrammaticLaunchCompletion()internally so any unknown subsequent kernels wouldn't launch early. If the subsequent kernels did launch early it'd be a bug in them too, but just to be safe we can avoid the final call tocudaTriggerProgrammaticLaunchCompletion()in this sequence of kernels (the call in kernel3) without any performance penalty, right?
I got your point. We need to handle the trigger from the kernel and the launch attribute separately. That makes sense.
/te-ci
@jberchtold-nvidia Now I use two configs to control the behavior of PDL:
pdl_sync: AddcudaGridDependencySynchronizeto the first kernel, to make sure the previous unknown kernel has flushed results to the global memory. The following kernels are launched to the same stream, and there's no data dependency between them, so they don't need this sync.pdl_trigger: AddcudaTriggerProgrammaticLaunchCompletionto all but the last kernel, so it won't trigger the next unknown kernel.
At the host side, we always set the cudaLaunchAttributeProgrammaticStreamSerialization attribute. The behavior still depends the kernel side cudaGridDependencySynchronize / cudaTriggerProgrammaticLaunchCompletion. If there's no sync nor trigger, the kernel would behave as a normal one without PDL.
@jberchtold-nvidia Now I use two configs to control the behavior of PDL:
pdl_sync: AddcudaGridDependencySynchronizeto the first kernel, to make sure the previous unknown kernel has flushed results to the global memory. The following kernels are launched to the same stream, and there's no data dependency between them, so they don't need this sync.pdl_trigger: AddcudaTriggerProgrammaticLaunchCompletionto all but the last kernel, so it won't trigger the next unknown kernel.At the host side, we always set the
cudaLaunchAttributeProgrammaticStreamSerializationattribute. The behavior still depends the kernel sidecudaGridDependencySynchronize/cudaTriggerProgrammaticLaunchCompletion. If there's no sync nor trigger, the kernel would behave as a normal one without PDL.
Thanks! Using cudaGridDependencySynchronize seems like a good idea. The PR LGTM from my side but will defer to Przemek or Tim for final approval since I'm not as familiar in using PDL. Thanks for the PR!
/te-ci
@yaox12 @timmoon10 - This PR is has conflicts. I don't know if it is because the PR needs to fixed or problems with the CI. Would you please say what is the next step here to fix the CI issues?
@yaox12 @timmoon10 - This PR is has conflicts. I don't know if it is because the PR needs to fixed or problems with the CI. Would you please say what is the next step here to fix the CI issues?
I'm working on fixing the conflicts and CI issues.
/te-ci
/te-ci
/te-ci
Ready for review. The CI failures are irrelevant.
Close this PR because
- We prefer to use a grouped quantize to further reduce the CPU overhead.
- There're parallel work on optimizing the quantization kernels, which makes the PDL hard to maintain.