modular [stdlib] Fix block-level sum and prefix sum operations

Block-level sum and prefix_sum are not currently implemented correctly. The problems are:

Multiple places where the execution is blocked (warp_id() and lane_id() broadcast the values to all warp lanes, but only one of the lanes enters the block where these functions are called - as the result, synchronize operations are blocking the execution because every thread other than the leader is not joining the execution)
Incorrect exclusive handling

This commit is a part of the project submission for "Modular GPU Kernel Hackathon" at the AGI House.

May 10 '25 21:05 kirillbobyrev

!sync

May 10 '25 22:05 abduld

!sync

May 10 '25 22:05 JoeLoser

!sync

May 10 '25 23:05 JoeLoser

Thank you for tracking this down and fixing it Kirill!

May 11 '25 14:05 lattner

@lattner Thank you very much! I wanted to add tests and make sure that the changes are good, too, to finally submit a "complete" patch, but barely had enough time for the core algorithm implementation, so unfortunately that was out of reach.

But I'm still interested to finish, because writing tests is not that hard after the implementation and I could make this more of a "complete" patch.

Unfortunately, I couldn't find any instructions and ways to build the project, is there any way to read how it's done?

I've looked at the new CI scripts, but I could't reproduce it (bazelw didn't work as expected). I know how to use Bazel and write rules, but I couldn't figure out how to run it properly. Given, I was in a rush, but I am curious if there's a simple way to build code and tests so that I could finish it?

May 11 '25 16:05 kirillbobyrev

I've looked at the new CI scripts, but I could't reproduce it (bazelw didn't work as expected). I know how to use Bazel and write rules, but I couldn't figure out how to run it properly. Given, I was in a rush, but I am curious if there's a simple way to build code and tests so that I could finish it?

We're /just/ getting build scripts in place - I expect it will come together with CI and good docs in the next few weeks. Here's an initial bazel doc that landed yesterday for example: https://github.com/modularml/modular/pull/61602/files

May 11 '25 18:05 lattner

I see, thanks for the pointer! I think I did something similar yesterday, but had the errors. Those were likely with the system setup/missing dependencies etc, but I hadn't enough time to look at it, so maybe I can solve that. I'll give it another try tomorrow!

May 11 '25 20:05 kirillbobyrev

!sync

May 12 '25 22:05 JoeLoser

I see, thanks for the pointer! I think I did something similar yesterday, but had the errors. Those were likely with the system setup/missing dependencies etc, but I hadn't enough time to look at it, so maybe I can solve that. I'll give it another try tomorrow!

Hi @kirillbobyrev - thanks for the contribution, first of all! If you're up for finishing up this patch to add tests, here is how you can do so:

./bazelw test max/kernels/test/gpu/basics/test_prefix_sum.mojo.test

will use our new Bazel scripts @lattner was mentioning that we just brought up the other day to build and run this particular test. I recommend adding/modifying the tests in that file (test_prefix_sum.mojo) to account for your fixes. If you're on top of tree/rebased, I suspect you'll run into one small issue:

./bazelw test max/kernels/test/gpu/basics/test_prefix_sum.mojo.test
INFO: Analyzed target //max/kernels/test/gpu/basics:test_prefix_sum.mojo.test (98 packages loaded, 6948 targets configured).
ERROR: /home/ubuntu/dev/modular-public/mojo/stdlib/stdlib/BUILD.bazel:12:13: //mojo/stdlib/stdlib:stdlib building mojo package failed: (Exit 1): mojo failed: error executing MojoPackage command (from target //mojo/stdlib/stdlib:stdlib) external/rules_mojo++mojo+mojo_toolchain_linux_x86_64/bin/mojo package '-strip-file-prefix=.' -o bazel-out/k8-fastbuild/bin/mojo/stdlib/stdlib/stdlib.mojopkg mojo/stdlib/stdlib

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
Included from "mojo/stdlib/stdlib/builtin/__init__.mojo":1:
mojo/stdlib/stdlib/builtin/uint.mojo:823:42: error: '@always_inline("builtin")' does not support MLIR operation index.ceildivu
        return __mlir_op.`index.ceildivu`(self.value, denominator.value)

This is is because right now, the Bazel build is hard-coding a specific version of the Mojo compiler to use and the latest standard library requires a newer Mojo compiler in order to support doing the folding behavior of @always_inline(builtin) on ceildivu. See https://github.com/modular/modular/pull/4547 for example that recently bumped the nightly Mojo compiler. @Ahajha is working on automating this as part of the nightly release process and this will be absolutely necessary as we transition to using Bazel for both building and testing the Mojo standard library and MAX Kernels Library.

Feel free to kick the tires with the Bazel build invocation I mentioned above and give feedback! You'd be our first non-Modular user of it as I haven't even announced it to the community yet as we're working through some things still (such as https://github.com/modular/modular/pull/4550).

May 12 '25 23:05 JoeLoser

@JoeLoser thank you for the detailed instructions!

I've dealt with the issue that you've describing (just removed @always_inline("builtin") for ceildiv, but I'm getting another error, which looks to be Bazel build-related:

ubuntu@152-70-145-51:~/modular$ ./bazelw test --verbose_failures //max/kernels/test/gpu/basics:test_prefix_sum.mojo.test
WARNING: Build option --linkopt has changed, discarding analysis cache (this can be expensive, see https://bazel.build/advanced/performance/iteration-speed).
INFO: Analyzed target //max/kernels/test/gpu/basics:test_prefix_sum.mojo.test (0 packages loaded, 6914 targets configured).
ERROR: /home/ubuntu/modular/max/kernels/test/gpu/basics/BUILD.bazel:45:14: Linking max/kernels/test/gpu/basics/test_prefix_sum.mojo.test failed: (Exit 1): cc_wrapper.sh failed: error executing CppLink command (from target //max/kernels/test/gpu/basics:test_prefix_sum.mojo.test)
  (cd /home/ubuntu/.cache/bazel/_bazel_ubuntu/a8246705211370f9a29d0c6e94adcf5a/sandbox/linux-sandbox/21/execroot/_main && \
  exec env - \
    PATH=/bin:/usr/bin:/usr/local/bin \
    PWD=/proc/self/cwd \
    ZERO_AR_DATE=1 \
  external/toolchains_llvm++llvm+llvm_toolchain/bin/cc_wrapper.sh @bazel-out/k8-fastbuild/bin/max/kernels/test/gpu/basics/test_prefix_sum.mojo.test-0.params)
# Configuration: 64291322919f1be71fafefa4ff1a2ef983ab6ddd96fa8f249a314058786fefaa
# Execution platform: @@+_repo_rules4+modular_host_platform//:modular_host_platform

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
ld.lld: error: unable to find library -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Target //max/kernels/test/gpu/basics:test_prefix_sum.mojo.test failed to build
INFO: Elapsed time: 0.441s, Critical Path: 0.05s
INFO: 2 processes: 17 action cache hit, 2 internal.
ERROR: Build did NOT complete successfully
//max/kernels/test/gpu/basics:test_prefix_sum.mojo.test         FAILED TO BUILD

I've looked at how my system is configured and it looks like libstdc++ is installed:

ubuntu@152-70-145-51:~/modular$ find / -name "libstdc++.so*" 2>/dev/null
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
/usr/lib/gcc/x86_64-linux-gnu/11/libstdc++.so
/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30-gdb.py
/snap/google-cloud-sdk/531/usr/lib/x86_64-linux-gnu/libstdc++.so.6
/snap/google-cloud-sdk/531/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25
/snap/google-cloud-sdk/531/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25-gdb.py
/snap/core20/2496/usr/lib/x86_64-linux-gnu/libstdc++.so.6
/snap/core20/2496/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28
/snap/core20/2496/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28-gdb.py
/snap/core20/2571/usr/lib/x86_64-linux-gnu/libstdc++.so.6
/snap/core20/2571/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28
/snap/core20/2571/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28-gdb.py
/snap/core22/1748/usr/lib/x86_64-linux-gnu/libstdc++.so.6
/snap/core22/1748/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
/snap/core22/1748/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30-gdb.py
/snap/core22/1963/usr/lib/x86_64-linux-gnu/libstdc++.so.6
/snap/core22/1963/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
/snap/core22/1963/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30-gdb.py
/opt/miniconda/lib/libstdc++.so.6.0.29
/opt/miniconda/lib/libstdc++.so.6
/opt/miniconda/lib/libstdc++.so
/opt/miniconda/pkgs/libstdcxx-ng-11.2.0-h1234567_1/lib/libstdc++.so.6.0.29
/opt/miniconda/pkgs/libstdcxx-ng-11.2.0-h1234567_1/lib/libstdc++.so.6
/opt/miniconda/pkgs/libstdcxx-ng-11.2.0-h1234567_1/lib/libstdc++.so

I'm running it on Ubuntu with NVidia GPU that I rent from Lambda.ai

ubuntu@152-70-145-51:~/modular$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.5 LTS
Release:	22.04
Codename:	jammy

I made sure that build-essential and g++ are installed, but the issue persists.

I've also tried compiling a simple hello_world.cpp with -lstdc++ and it worked, too, so I'm unsure why the Bazel build is not working. I've looked into the generated files and looks like it should be trying to find the system libstdc++, but somehow failing to do so?

Am I doing something wrong?

May 13 '25 05:05 kirillbobyrev

Hi @kirillbobyrev thanks for working through this with us. It looks like this is actually a 2 part issue 🙃 I'm working from public Ubuntu AMIs (both with and without GPU).

Starting with the non-GPU instance, it looks like libstdc++ is indeed missing. And you were right to install build-essential (which includes libstdc++-11-dev). That seemed to fix the issue for me.

Moving on to the GPU instance, it turns out that the NVidia packages install gcc 12, which causes clang to get confused an pick the wrong installation. Some logs from my testing:

$ external/toolchains_llvm++llvm+llvm_toolchain_llvm/bin/clang --verbose
clang version 20.1.1None
...
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/11
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12
Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda, version 12.8

In this case, your machine is likely picking the partial install of 12 instead of the fixed 11 install. I was able to workaround by explicitly fixing the installation of 12:

sudo apt install libstdc++-12-dev

Please try that fix and let me know how it goes!

May 13 '25 18:05 patrickdoc

Thank you very much for the instructions! It works for me now :)

I've added the tests which are failing at HEAD and passing at my branch. I've also fixed another small issue (warp-level sum claims to broadcast in the docs but doesn't in the code - now I make sure to broadcast in the code, too).

I think that the patch is ready to go now, you can take a look and test, too. Thanks a lot for the guidance!

I've tested with

$ ./bazelw test \
    max/kernels/test/gpu/basics/test_sum.mojo.test \
    max/kernels/test/gpu/basics/test_prefix_sum.mojo.test

(sorry, I shouldn't really force-push, but I'm writing code and committing in the browser and then actually running the code in Lambda instance which doesn't have my SSH keys or anything... I should probably get a more reasonable work setup)

May 15 '25 02:05 kirillbobyrev

Thank you very much for the instructions! It works for me now :)

I've added the tests which are failing at HEAD and passing at my branch. I've also fixed another small issue (warp-level sum claims to broadcast in the docs but doesn't in the code - now I make sure to broadcast in the code, too).

I think that the patch is ready to go now, you can take a look and test, too. Thanks a lot for the guidance!

I've tested with
$ ./bazelw test \
    max/kernels/test/gpu/basics/test_sum.mojo.test \
    max/kernels/test/gpu/basics/test_prefix_sum.mojo.test
(sorry, I shouldn't really force-push, but I'm writing code and committing in the browser and then actually running the code in Lambda instance which doesn't have my SSH keys or anything... I should probably get a more reasonable work setup)

No worries about force pushing as things get squashed down into a single commit internally when we merge this. It's still easy to see the diffs between force pushes on GitHub these days 😄

I really appreciate you rounding out the patch to add tests and working with us on testing the Bazel build we're rolling out ❤️

Before I sync this internally, can you please also add a changelog entry and credit yourself? This is a great PR and I want to make sure others know about the algorithm and fixes 👍

May 15 '25 14:05 JoeLoser

Amazing work @kirillbobyrev !!!

May 15 '25 14:05 abduld

Thanks @JoeLoser, I've added a changelog entry.

Thanks @abduld!

May 16 '25 00:05 kirillbobyrev

!sync

May 16 '25 18:05 JoeLoser

@kirillbobyrev I can't yet land this internally since one of the tests is failing on H100. It's passing internally on MI300 and A100 though. Here's a snippet from a run internally since our open source doesn't have access to H100. The error message leaves room for improvement as you can tell. If you happen to have access to an H100 and want to poke at this, feel free. Otherwise, I'll have someone internally take a look at fixing this so we can land it.

==================== Test output for //open-source/max/max/kernels/test/gpu/basics:test_prefix_sum.mojo.test:
CUDA call failed: CUDA_ERROR_ILLEGAL_ADDRESS (an illegal memory access was encountered)
================================================================================

May 16 '25 19:05 JoeLoser

Hmm, I see. Yeah, let me check it out on the weekend. I can rent an H100 for few dollars on Lambda, should probably not be a problem.

May 16 '25 21:05 kirillbobyrev

Hmm, I see. Yeah, let me check it out on the weekend. I can rent an H100 for few dollars on Lambda, should probably not be a problem.

@jiex-liu may be able to poke at it internally and see where the issue is as well if nothing obvious jumps out. Appreciate you bearing with us as we don't have GPU testing in our open source CI builds yet 😊

May 16 '25 21:05 JoeLoser

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the main branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

May 19 '25 20:05 modularbot

Landed in 3c01c8054654d02b1866dc0fb0b6bc4b5077a653! Thank you for your contribution 🎉

May 20 '25 07:05 modularbot