xmr-stak-amd icon indicating copy to clipboard operation
xmr-stak-amd copied to clipboard

GPU stops and miner freezes.

Open joaogoldrocha opened this issue 7 years ago • 42 comments

Hello,

I've been facing some issues with my rig as it out of the blue hangs xmr-stak-amd miner due to a GPU failure. My question is, will it be possible to have the miner automatically disabling the faulty GPU and notifying the owner, somehow, while it keeps doing his thing?

It was quite nice to have such feature on this great software and I'm sure the community would appreciate it :).

Thanks

joaogoldrocha avatar Sep 18 '17 11:09 joaogoldrocha

The question is why the miner freez. Do you overclocked your gpu?

It is not a good practice to ignore errors, it would be much better if the miner stops if something goes wrong. Than it is possible for an external script to restart the miner or handle the broken gpu. All in all we need to find out why the miner freez and must solve the issue.

Am 18.09.2017 1:32 Nachm. schrieb "GoldPT" [email protected]:

Hello,

I've been facing some issues with my rig as it out of the blue hangs xmr-stak-amd miner due to a GPU failure. My question is, will it be possible to have the miner automatically disabling the faulty GPU and notifying the owner, somehow, while it keeps doing his thing?

It was quite nice to have such feature on this great software and I'm sure the community would appreciate it :).

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fireice-uk/xmr-stak-amd/issues/110, or mute the thread https://github.com/notifications/unsubscribe-auth/AYsxthn3tJ-9pG3JC8v370ZBaWo91Lk8ks5sjlSzgaJpZM4PaxcK .

psychocrypt avatar Sep 18 '17 11:09 psychocrypt

In this specific case no, there's no overclock, but still I think you miss understood the point.

I would love to have something that would notify me that there's an issue in the rig without stopping the whole miner. It would keep going and letting me know there's a problem.

joaogoldrocha avatar Sep 18 '17 11:09 joaogoldrocha

This would mix two orthogonal tasks within the miner. This is not a good practice. The miner should never crash on a health system. If so than there is a bug, but the miner it self can not check for unknown bugs. Monitoring of a systems is a task for special software like nagios or ganglier. If you are e.g. using centreon you can write own test for the health of the miner or system and can configure sms, mail or other notifications.

Add a test that you get notified if the load of the gpu or cpu is to low.

Am 18.09.2017 1:44 Nachm. schrieb "GoldPT" [email protected]:

In this specific case no, there's no overclock, but still I think you miss understood the point.

I would love to have something that would notify me that there's an issue in the rig without stopping the whole miner. It would keep going and letting me know there's a problem.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fireice-uk/xmr-stak-amd/issues/110#issuecomment-330195732, or mute the thread https://github.com/notifications/unsubscribe-auth/AYsxtsxdy3KaZRxNWIHgzYbfxn4eOMW8ks5sjleDgaJpZM4PaxcK .

psychocrypt avatar Sep 18 '17 11:09 psychocrypt

I had the same issue. My HD 7950 freezes while "affine_to_cpu" is false in config especially while monitor goes to sleep or windows 10 switches night vision. Also soft like gpu-z or voltage monitoring at msi afterburner causes freezes and it doesn't depend on overlock.

applicate2628 avatar Sep 19 '17 21:09 applicate2628

Is an error shown on the terminal?

Am 19.09.2017 23:24 schrieb "Snegov1k" [email protected]:

I had the same issue. My HD 7950 freezes while "affine_to_cpu" is false in config especially while monitor goes to sleep or windows 10 switches night vision. Also soft like gpu-z or voltage monitoring in msi afterburning causes freeze and it doesn'd depend on overlock.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fireice-uk/xmr-stak-amd/issues/110#issuecomment-330677935, or mute the thread https://github.com/notifications/unsubscribe-auth/AYsxtkpa3ljQW5Nt1XqfMhlemA-G5eqHks5skDEBgaJpZM4PaxcK .

psychocrypt avatar Sep 20 '17 05:09 psychocrypt

No errors just stops mining after lags. Ocasionally BSOD occurs with THREAD_STUCK_IN_DEVICE_DRIVER error.

applicate2628 avatar Sep 20 '17 06:09 applicate2628

I'm getting similar crashes where wattman will throw an error in windows and mining activities stop because of it. No over clock applied and temperatures seem to be normal whenever it happens.

Is there anything I can provide to help troubleshoot?

eMadman avatar Sep 20 '17 11:09 eMadman

I'm actually seeing an issue where the miner stays active (no GPU errors), but it simply stops hashing/communicating altogether. It seems to happen regularly every six hours or so. If I close the miner and reopen, it usually starts again.

I've activated logging, but the log only shows what was displayed in the console. I'll see if verbose logging gives me any additional info.

jonsully avatar Sep 20 '17 14:09 jonsully

@jonsully - are you seeing a wattman error in your task tray around the time that happens? XMR stack was showing a normal hashrate, but the pool and CPU-Z showed my card was idle. Ended up going overnight without any mining activity even though it was showing ~400H/s. I'll try logging after work tomorrow and share my findings as well.

eMadman avatar Sep 20 '17 14:09 eMadman

@eMadman Actually I had the opposite. Hashrate on the pool was 0, GPUs do not go idle or throw any errors. The console stops updating and becomes unresponsive. No Wattman errors are occurring as when I restart xmr-stack all cards are hashing at normal rates.

jonsully avatar Sep 20 '17 15:09 jonsully

@jonsully I have the exact same issue. After around 6 hours the rate just drops and the program seems frozen, doesn't respond to input and nothing is being output. No errors showing. Restarting the program resumes activity like normal

vebjornr avatar Sep 23 '17 20:09 vebjornr

Is your card overclocked?

Am 23.09.2017 10:59 Nachm. schrieb "vebjornr" [email protected]:

@jonsully https://github.com/jonsully I have the exact same issue. After around 6 hours the rate just drops and the program seems frozen, doesn't respond to input and nothing is being output. No errors showing. Restarting the program resumes activity like normal

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fireice-uk/xmr-stak-amd/issues/110#issuecomment-331669575, or mute the thread https://github.com/notifications/unsubscribe-auth/AYsxtjviwEDvJz19-CZKNTgARoLNZ-DDks5slXFJgaJpZM4PaxcK .

psychocrypt avatar Sep 24 '17 06:09 psychocrypt

I have the exact same issue. After about 1-2-4 hours the rate just drops and the program seems frozen, doesn't respond to input and nothing is being output. No errors showing. Restarting the program resumes activity like normal. The GPUs keep drawing power from the wall as if they are mining and stay hot, but no shares are being submitted and pool shows 0h/s. This is happening on both of my miners: MSI z170a gaming m5 with 6x RX 480 & Biostar TB250 BTC PRO with 6x RX 470 ------ ALL of my GPU's have MODDED BIOS but NOT overclocked

I also have to mention that I am running Ubuntu 16.04.03 LTS and I was also experiencing this issue with wolf-xmr-miner-0.4 as well ---I switched to xmr-stak-amd and still am experiencing the same issues. The only miner that has never given me any problems was Claymore XMR GPU miners 0.95-0.97 in Windows --- they work perfectly, non-stop, 24/7 ---the only problem with Windows is its too power hungry --- that is why I am trying to mine on Linux, ----but as of now even wolf-xmr-miner is working better for me than xmr-stak-amd.

ghost avatar Sep 25 '17 14:09 ghost

My R280x is not overclocked and I've gone through the XMR-STAK logs as well as windows event viewer. I can't find any events between the two that would indicate a source. XMR's logs aren't verbose enough, and windows only shows me a message in logs after the video card becomes unresponsive.

I've noticed that my card is hovering around 80c when mining with XMR and I'm starting to think the card is crashing itself to prevent overheating rather than throttling itself. I've experienced crashes during extended gaming sessions when the card is pushed to its very limits for too long.

Emad Ghazipura web: http://emadness.tumblr.com || http://flickr.com/eMadman phone: 416.854.3720

On Mon, Sep 25, 2017 at 10:34 AM, eugeneccnp [email protected] wrote:

I have the exact same issue. After around 2-4 hours the rate just drops and the program seems frozen, doesn't respond to input and nothing is being output. No errors showing. Restarting the program resumes activity like normal. The GPUs keep drawing power from the wall as if they are mining, but no shares are being submitted. This is happening on both of my miners: MSI z170a gaming m5 with 6x RX 480 & Biostar TB250 BTC PRO with 6x RX 470 ------ of my GPU's are OVERCLOCKED!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fireice-uk/xmr-stak-amd/issues/110#issuecomment-331900961, or mute the thread https://github.com/notifications/unsubscribe-auth/AXlNPENis2k5k-3cwm7xLRywJIL0Ugt4ks5sl7n3gaJpZM4PaxcK .

eMadman avatar Sep 25 '17 16:09 eMadman

I have this same issue with my gpus When ONE gpu freeze all xmr-stack-amd freeze, and I cant stop it, even killing the process, and ubuntu dont even restart.. to avoid this problem I am running all gpus in xmr-stack-gpus separately and then only one gpu freezes (cant reboot my machine all the way) On claymore miner he avoid this problem by restarting the gpu that freezes and the miner just keep working... Maybe xmr-stack-amd should do the same, or at least stop only the faulty gpu and keep mining with tht rest

MalMen avatar Oct 07 '17 14:10 MalMen

Experience the same when I overclock the memory on my Vega 56s. I would like to see the miner exits when GPU faulted instead of freezing. I am using docker to manage miners, and it can restart itself when the containers stop.

calvintam236 avatar Oct 19 '17 04:10 calvintam236

I got similar issue. The miner would work fine for couple of hours and suddenly go unresponsive. Mining pool would show 0 Mh/s. Power consumption would still be high (means cards are working at full load). Miner is unresponsive. Ctrl+C will not work. have to kill the process. Restarting miner won't work. Have to reset/shutdown the system. No clue what goes wrong. Any debugging steps will be appreciated.

rijujohnx avatar Nov 13 '17 02:11 rijujohnx

This is hopfully fixed with the next release but the reason can be a undervolted and overclocked gpu.

psychocrypt avatar Nov 13 '17 06:11 psychocrypt

@psychocrypt sounds good. However there are no debugging info whatsoever in the miner. Some sort of debug logs would help to troubleshoot the issues much better.

The gpus's are overclocked and undervolted. However, they run stable for weeks while mining Ethereum. If it's the gpu overclocking/undervolting then it might be that the gpu's run in a different state (lower p-state since power consumption is lower than Eth mining 900w vs 1250w) which while modding the BIOS I didn't pay too much attention to and maybe unstable. This is just a conjecture anyways. Without debugging logs it's very difficult to narrow down.

EDIT: It was due to undervolt, low TDP/TDC of the modded BIOS. Tuned it and has been stable for 3 days without hangup

rijujohnx avatar Nov 13 '17 15:11 rijujohnx

Minero mining and Eth mining is different. This means a stable eth system must not be stable during monero mining. Please set all to default to check if it is the miner or the changed clock and voltage. Overclocked memory without ECC must be seen as instable one bit flip can produce a on device endless loop.

psychocrypt avatar Nov 13 '17 16:11 psychocrypt

I have freezing problem with all xmr-stak software on my Windows 10 machine. Nvidia 560GTX, AMD RX580, Intel i5 6400 all three programs stuck at some point until I press a key to resume them. Now and the new Xmr-Stack all in one do it the same way.

EDIT: I found it is the properties of the CMD that make this freeze for me. I turned off "Quick Edit Mode" and "Insert Mode" and I haven't got this issue for months.

pecuna avatar Nov 21 '17 16:11 pecuna

because of this issue, I switched to xmrig-amd..

calvintam236 avatar Nov 21 '17 16:11 calvintam236

I can confirm the issue but with a higher occurrence rate.

I run on a Asrock H110 Pro BTC with 12x R9 290 GPUs. The xmr-stak UI freezes within 10 minutes after launch, and wouldn't restart GPUs after being killed and relaunched. The machine wouldn't shutdown, I have to manually actuate the power switch.

The same setup with only 6 cards was stable for 7 hours yesterday.

NicolBol avatar Nov 22 '17 09:11 NicolBol

I've installed the latest release on Ubuntu 16.04 with AMD A4, Nividia 1050. It runs fine until I stop the command line tool. When I stop the tool it freezes the system. I have to push the reset button.

brmmm3 avatar Nov 28 '17 20:11 brmmm3

Possibly riser issue. I have two mining rig, one of them has 8 gpu and the other one has 4 gpu. 8 gpu rig has never stopped until I quit the mining application. 4 gpu rig had same issue mentioned above. It has 3 x rx vega 56, and 1 x rx 580 gpu, and all of them were overclocked. When I encounter this issue, I have checked Radeon's Global WattMan settings and recognized that RX 580 gpu stopped working. I reset its overclock settings but it was still freezing after a few minutes or hours (randomly). I thought it migth be GPU issue because all of overclocked RX Vegas was working fine. Finally, (I don't remember where i read) I decided to replace the riser of RX 580. Now, it's working for 2 days non-stop and overclocked. Please prefer new generation risers.

ocalozyavuz avatar Dec 27 '17 01:12 ocalozyavuz

So i was pulling my hair out with this issue. I applied 100mhz over clock which has solved the problem. Keep in mind I overclocked memory by 650mhz. But this helped even tho I wasnt overclocking memory. Its been going solid now for the last 2days.

Fredz1 avatar Dec 30 '17 11:12 Fredz1

I'm having similar problems here with a 6 GPU rig, a mix of RX 580's and RX 550's - they are all bios-modded but not overclocked. Built on windows from commit d015a3d on the dev branch.

No logs in the console at all, just a frozen miner.

--edit 2018-01-13-- Removed one of the cards which gave stable mining for about 12 hours, then the miner froze again. Upon killing the miner the entire windows machine crashed. Anyway, it does seem like hardware problems and not software.

nover avatar Jan 12 '18 08:01 nover

Same issue on cast-xmr, running 3x Vega 56(with 64bios), and 4x Vega 64, got these two rigs stable(ish) to run for 3-4 days with 0.7-1% errors due to expired blocks- which is absolutely fine.

However like many people mentioned already, same issue with one of the cards freezing, which freezes the whole pc. This happens a lot more often with Vega 64. And is usually the same card that gives issues. Been playing with overclocking, which does seem to improve or make it worse, have to get the "right" over-clocks for each card. Already tried: -changing risers - doesn't help -installing aug23 drivers while all of them plugged, and separately - doesn't help -followed almost every tutorial there is on "how to get 2000h/s"- the only difference i found, one of the modders had stable registry file, which gave lower temps, lower wattage while keeping in 1750-1950h/s range- otherwise none of the tutorials helped the stability issue

Looks like the problem might be only with some cards. After under-volting the flashed Vega 56's, they seem to run stable for 5-7 days, before freezing. While Vega64 rig still has one card which I'm unable "fix".

I doubt anyone has a proper fix for our problems, but I thought I would just put this out, since literally everyone is dealing with this issue regardless of what miner/system/hardware they use.

slapenke avatar Jan 14 '18 11:01 slapenke

I had the same issue with my new built rig for over a week. Luckily my problem was with the risers, I was using cheap quality 1x to16x riser which I bought from ebay, now I am using PCI-E 1X TO 16X GPU Mining Extender Riser Multi-interface Adapter W/ LED wich you can find in this link https://www.ebay.com/itm/6-Pack-PCI-E-1X-TO-16X-GPU-Mining-Extender-Riser-Multi-interface-Adapter-W-LED/172982585253?hash=item284690c7a5:g:eQgAAOSw3RZaOqft
My rig is still up for 2 days rock solid! (I am using 5 1080 ti on an MSI z270-a pro MoBo)

abdoomaster avatar Jan 23 '18 03:01 abdoomaster

I have the same issue with the miner freezing (no errors, no logs), on a 270x, directly connected to the motherboard. Don't know if it's a hardware or software issue, but I can see in the windows logs that the driver stopped responding.

If I run furmark or other tools, the system is stable. Temperature while mining does not exceed 65 degrees.

Any ideas on what to check further on ?

aproapeom avatar Jan 24 '18 19:01 aproapeom