John Pellman
John Pellman
Well, the syscall where PyTorch is getting messed up is in our _strace_ output already: ``` ioctl(5, _IOC(0, 0x00, 0x19, 0x00) ``` _ioctl_ is used for performing I/O calls outside...
As yet another bit of info, I ran [memtestG80](https://github.com/ihaque/memtestG80) on each of the GPUs on my system. Pretty much all GPUs were fine except for the first one (index 0)....
I ran pdb on one of the consistently unkillable PyTorch programs. The point at which the script became unkillable was when it ran something similar to `model.to(device)`. The function in...
A couple other tidbits about my configuration: * IOMMU is not on, virtualization is disabled, both in software and hardware. * I'm able to replicate the issue by just running...
I just did. You can find it [here](https://developer.nvidia.com/nvidia_bug/2712582) (NVBug 2712582).
I've received no updates from Nvidia
I think the best workaround will be to fix #10657 :)
Interesting- I had come across the Supermicro post [here](https://www.supermicro.com/support/faqs/faq.cfm?faq=20732) describing that potential issue, but when I ran `lspci -vvv | grep ACSCtrl` I didn't get any output. I assumed that...
You also might want to read this thread: https://news.ycombinator.com/item?id=10961462