Results 28 comments of jamesxu2

Hi @yangyangv8 , thanks for your patience. I was able to reproduce the PCIe atomics error on ROCm 5.7.3 and you can observe that the workaround of applying the target_compile_options(rccl...

Hi @hsadasiv or @preda , could you provide a reproducer for this issue? I don't see this showing up when inserting barrier(*) instructions and extracting temps with ```hipcc --save-temps -g```....

Hi @preda, after some discussion with our LLVM internal team, this issue is already known and a fix is in the works, though it has not yet been merged into...

Hello @fwinter , Thank you for the detailed reproduction info. I'm using your CMakeLists, build.sh and llvm_init.cc files and I'm unable to reproduce your issue on my system (AMD Ryzen...

Hello @fwinter , I've done some more testing on a Frontier-like system and have some recommendations: I am able to successfully compile and run your test program, with a more...

Hi @brownbat, I've tried reproducing your example > python3 researcher.py --num_samples 1000000 --num_layers 256 --batch_size 256 --embedding_dim 512 --hidden_dim 512 with three configurations: 1. ROCm 5.7.3 + Torch 2.4 (Nightly...

Hi @maxweiss , I was able to reproduce your issue and confirm this theory: > It looks like the function amd::smi::GetProcessInfoForPID returns too early when the files are missing in...

Hi @nonetrix, I tried this configuration and I wasn't able to reproduce your crash. Please have a look at my configuration and steps and tell me if there's something I'm...