Mika Laitio

Results 129 comments of Mika Laitio

bazel used to make make it very time consuming to try to get tensorflow builds to work as it was very hard to do anything increamentally after fixing build errors...

Thanks, I heard that it should be able to get linux vm with epyc cpu and amd gpu from azure cloud that could be a good place to test this.

I will close this as it's more like an comment/documentation now than real pull request that is planned to be merged.

One thing to try out is to build the very latest kernel from the git. (6.11-rc4) as there are quite many fixes for APU's on the latest kernel. I have...

Well it's good to know that the fix is not there in new kernel. Just to verify other thing. Once the gpu-reset happen, the system still is able to reset...

Thanks, I agree. I have not really had much time to test this directly except just by building 6.11-rc4-rc6 and final kernel. In-directly I did some work on this by...

@jrl290 Thanks for the great test cases and traces, I think I have now a fix for this, your test case has now been running on loop multiple hundred rounds...

@jrl290 Attached is the new version of your test case, it's basically same just small helper changes without modifying your original logic. 1) #export HIP_VISIBLE_DEVICES="1" line to gpu_crash.sh to show...

@jrl290 Here is the link to kernel fix. It took a while as I tried couple of different way to fix it but this was basically the only one I...

These should be easy steps: 1) git clone https://github.com/lamikr/linux.git 2) cd linux 3) git checkout release/rocm_612_gfx1102_fix 4) copy the kernel_build.sh and kernel_612_config files from kernel_build_script.zip file abowe to linux directory...