jamesxu2
jamesxu2
Hi @yangyangv8 , thanks for your patience. I was able to reproduce the PCIe atomics error on ROCm 5.7.3 and you can observe that the workaround of applying the target_compile_options(rccl...
Hi @yangyangv8 , do you have any updates?
Hi @hsadasiv or @preda , could you provide a reproducer for this issue? I don't see this showing up when inserting barrier(*) instructions and extracting temps with ```hipcc --save-temps -g```....
Hi @preda, after some discussion with our LLVM internal team, this issue is already known and a fix is in the works, though it has not yet been merged into...
Hello @fwinter , Thank you for the detailed reproduction info. I'm using your CMakeLists, build.sh and llvm_init.cc files and I'm unable to reproduce your issue on my system (AMD Ryzen...
Hello @fwinter , I've done some more testing on a Frontier-like system and have some recommendations: I am able to successfully compile and run your test program, with a more...
Hi @fwinter, do you have any update?
Hi @brownbat, I've tried reproducing your example > python3 researcher.py --num_samples 1000000 --num_layers 256 --batch_size 256 --embedding_dim 512 --hidden_dim 512 with three configurations: 1. ROCm 5.7.3 + Torch 2.4 (Nightly...
Hi @maxweiss , I was able to reproduce your issue and confirm this theory: > It looks like the function amd::smi::GetProcessInfoForPID returns too early when the files are missing in...
Hi @nonetrix, I tried this configuration and I wasn't able to reproduce your crash. Please have a look at my configuration and steps and tell me if there's something I'm...