NBMiner icon indicating copy to clipboard operation
NBMiner copied to clipboard

Instability with Version 41.0 on 3070 Ti , if a GPU crashes

Open developeralgo8888 opened this issue 3 years ago • 3 comments

HiveOS: 0.6-217@220503 Nvidia Driver: 510.60.02 Kernel: 5.10.0-hiveos #110

Settings:

Seen it on Both EVGA and Zotac 3070 Ti LHR V2 Cards:

Core_Clock = 900 MHz Memory Clock = 2000 MHz Power Limit = No Power Limit set = 0 Temps core/mem = 39 / 87

There is no Display or anything connected to the cards. its headless rig.

When an NVIDIA GPU 3070 Ti crashes due to high OC values . The NBMiner does not release or reset the power of that particular GPU instead it has an " Xorg" process running and drops the HR from 80MH/s to 30 MH/s . You can view this using nvidia-smi -i "GPU-ID of crashed GPU" to verify the xorg process

If you try to restart the NBMiner after adjusting the OC values to lower values, it does not fix the issue. You have to reboot the rig in order to fix the issue and you can then use the new OC values. This seems to be bug.

i have over 50 cards of RTX 3070 Ti which are working fine with same OC settings. ii can adjust the settings with no issues , its when the GPU crashes it will not reset the power due to stuck "xorg" process.

Below is one of the cards

devbox@Walrus:/# nvidia-smi -i 4 Sun May 8 19:23:25 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 4 NVIDIA GeForce ... On | 00000000:06:00.0 Off | N/A | |ERR! 35C P2 186W / 310W | 5125MiB / 8192MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 4 N/A N/A 3335 G /usr/lib/xorg/Xorg 6MiB | | 4 N/A N/A 6814 C ./nbminer 5115MiB | +-----------------------------------------------------------------------------+

Any idea why its not resetting the power back to default and killing all the processes on the GPU , when it crashes ?

developeralgo8888 avatar May 08 '22 23:05 developeralgo8888

After trying it for a day I also saw a similar issue. My graphic card is 3070Ti as well and I'm using windows. I'm not sure about my true OC setting after the drop of the hash rate, but I can confirm resetting OC in MSI Afterburner cannot fix the hash rate problem.

Jiefei-Wang avatar May 09 '22 03:05 Jiefei-Wang

I have the same issue here. Im using the recomended driver, also decrease 100 and 200 mhz the memclock but it still happend.

on rigs using 5 x 3060 gpus, eventually one or two cards drops the hashrate to 17mh/s. === GPU 1, 02:00.0 GeForce RTX 3060 12288 MB, PL: 100 W, 170 W, 170 W === 02:17:25 SET POWER LIMIT: 170.0 W [Unknown Error] (exitcode=123) SET GPU CLOCKS: 1500 MHz Max Perf mode: 4 (auto) Attribute 'GPUGraphicsClockOffset' was already set to 0 ERROR: Error assigning value 2100 to attribute 'GPUMemoryTransferRateOffset' (E1-C1:0[gpu:1]) as specified in assignment '[gpu:1]/GPUMemoryTransferRateOffset[4]=2100' (Unknown Error). ERROR: Error assigning value 100 to attribute 'GPUTargetFanSpeed' (E1-C1:0[fan:1]) as specified in assignment '[fan:1]/GPUTargetFanSpeed=100' (Unknown Error). Attribute 'GPUFanControlState' (E1-C1:0[gpu:1]) assigned value 1. (exitcode=100)

on 3080 ti rigs also one or two cards eventually drops from 118 to 60~ mh/s

In all cases, only reboot recover the cards.

sabado avatar May 09 '22 06:05 sabado

have the same problem's with all my 3070 TI micron memory.

trimbilrepo avatar May 25 '22 02:05 trimbilrepo