chia-blockchain icon indicating copy to clipboard operation
chia-blockchain copied to clipboard

[Bug] GRResult error occuring a couple times a day farming a few PB of C2 compressed plots using an Nvidia P4 GPU - Bladebit

Open chain-enterprises opened this issue 1 year ago • 98 comments

What happened?

When the system (ProLiant DL360 Gen9, dual E5-2620 v4, 32 gigs ram, Nvidia P4, 75k C2 plots) hits a high IO load on the same block device as the Chia Full Node DB, shortly after the debug.log in chia will show GRResult not ok. The number of plots, lookup times, all seems fine - but the harvester stops finding proofs until the harvester is restarted. Happens 1-2 times in a 24 hour period on Alpha 4 through Alpha 4.3

Whenever error occurs, block validation time and lookup time consistently increase leading up to the error being thrown.

Reproducible with Nvidia Unix GPU Driver versions 530.30.03, 530.41.03, and 535.43.02

Version

2.0.0b3.dev56

What platform are you using?

Ubuntu 22.04 Linux Kernel 5.15.0-73-generic ProLiant DL360 Gen9, dual E5-2620 v4, 32 gigs ram, Nvidia P4, 75k C2 plots

What ui mode are you using?

CLI

Relevant log output

023-05-29T20:45:32.552 full_node chia.full_node.mempool_manager: WARNING  pre_validate_spendbundle took 2.0414 seconds for xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2023-05-29T20:45:42.620 full_node chia.full_node.mempool_manager: WARNING  add_spendbundle xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx took 10.06 seconds. Cost: 2924758101 (26.589% of max block cost)
2023-05-29T20:45:56.840 full_node chia.full_node.full_node: WARNING  Block validation time: 2.82 seconds, pre_validation time: 2.81 seconds, cost: None header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732042
2023-05-29T20:46:57.239 full_node chia.full_node.full_node: WARNING  Block validation time: 3.34 seconds, pre_validation time: 0.42 seconds, cost: 3165259860, percent full: 28.775% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732044
2023-05-29T20:49:26.913 full_node chia.full_node.full_node: WARNING  Block validation time: 2.40 seconds, pre_validation time: 0.49 seconds, cost: 2041855544, percent full: 18.562% header_hash: 8d0ce076a3270a0c8c9c8d1f0e73c9b5b884618ee34020d2a4f3ffafa459cfd0 height: 3732055
2023-05-29T20:51:06.259 full_node full_node_server        : WARNING  Banning 89.58.33.71 for 10 seconds
2023-05-29T20:51:06.260 full_node full_node_server        : WARNING  Invalid handshake with peer. Maybe the peer is running old software.
2023-05-29T20:51:27.986 harvester chia.harvester.harvester: ERROR    Exception fetching full proof for /media/chia/hdd23/plot-k32-c02-2023-04-23-someplot.plot. GRResult is not GRResult_OK.
2023-05-29T20:51:28.025 harvester chia.harvester.harvester: ERROR    File: /media/chia/hdd23/someplot.plot Plot ID: someplotID, challenge: 7b5b6f11ec2a86a7298cb55b7db8a016a775efea221104b37905366b49f2e2bd, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x7f3544998f30>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: contractHash>, plot_public_key=<G1Element PlotPubKey>, file_size=92374601728, time_modified=1682261996.8218756)
2023-05-29T20:51:57.482 full_node chia.full_node.full_node: WARNING  Block validation time: 10.23 seconds, pre_validation time: 0.29 seconds, cost: 959315244, percent full: 8.721% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732059
2023-05-29T20:55:24.640 full_node chia.full_node.full_node: WARNING  Block validation time: 3.18 seconds, pre_validation time: 0.26 seconds, cost: 2282149756, percent full: 20.747% header_hash: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx height: 3732067
2023-05-29T20:56:01.825 wallet wallet_server              : WARNING  Banning 95.54.100.118 for 10 seconds
2023-05-29T20:56:01.827 wallet wallet_server              : ERROR    Exception Invalid version: '1.6.2-sweet', exception Stack: Traceback (most recent call last):
  File "chia/server/server.py", line 483, in start_client
  File "chia/server/ws_connection.py", line 222, in perform_handshake
  File "packaging/version.py", line 198, in __init__
packaging.version.InvalidVersion: Invalid version: '1.6.2-sweet'

chain-enterprises avatar May 30 '23 21:05 chain-enterprises

Still happening with GPU driver Linux Nvidia beta v535.43.02

As soon as the following GPU error occurred - the GRResult error was thrown in the chia debug.log

[Tue May 30 19:49:17 2023] NVRM: Xid (PCI:0000:08:00): 31, pid=459359, name=start_harvester, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f53_718af000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

which led to the debug log error

2023-05-30T19:49:18.260 harvester chia.harvester.harvester: ERROR Exception fetching full proof for /media/chia/hdd142/plot-k32-c02-2023-04-26-06-20-xxxxxxxxxxxxx.plot. GRResult is not GRResult_OK

chain-enterprises avatar May 31 '23 04:05 chain-enterprises

Me too...windows 10 ,gui,. gpu device : image

error log like this: image and no proofs any more....

liyujcx avatar May 31 '23 12:05 liyujcx

Another "me too"

Dell R630 server, dual E5-2620v4 CPUs, 64GB RAM, Debian 11.7, Tesla P4 with 530.30.02 drivers.

ab0tj avatar May 31 '23 23:05 ab0tj

Ref issue https://github.com/Chia-Network/chia-blockchain/issues/15470

This isn't limited to a few times a day. I switched to a pool to test proofs and got flooded with these errors with each partial until falling back to CPU harvesting.

reythia avatar Jun 08 '23 21:06 reythia

Same issue here !

Ubuntu 22.04 / kernel 5.15.0-73 Driver Version: 530.30.02 CUDA Version: 12.1 Dual E5 2680 V4 / 256Gb 2133Mhz RAM / Tesla P4

Plots : C7 / around 9000

GRResult error in chia log + nvidia FAULT_PDE ACCESS_TYPE_READ in kernel log

Happens randomly, worse : 2hours, best : 20 hours without error.

jinglenode avatar Jun 13 '23 16:06 jinglenode

I am facing same GRResult issue. My details are:

Win 10. GTX 1060 with 535.98/CUDA 12.2 E5 2690V4/64GB RAM Currently 2428 C7 plots, and increasing. Using chia gui.

The issue has happened twice in last two days. Restarting the GUI fixed the issue.

prodchia avatar Jun 13 '23 16:06 prodchia

I am able to consistently reproduce this error on a Windows-based system by using the Disable Device option in the display driver properties menu, waiting a few seconds, and enabling the device with the same button. The GRResult issue will then appear in the logs.

thesemaphoreslim avatar Jun 22 '23 19:06 thesemaphoreslim

I am also affected by this.

Running a distinct harvester (separated from full_node and farmer) on a BTC-T37 board with a Tesla P4 GPU and a LSI 9102 (SAS2116) HBA. Both HBA and GPU are attached via PCIe 1x gen2. Ubuntu 22.04 is running on a Celeron 1037U CPU with 4GB DDR3 RAM.

My harvester node is of version 2.0.0b3.dev116 bladebit alpha 4.3 obtained via the chia discord. Tried bladebit alpha 4.4 but this will not work at all. Farming 4280 C7 plots (bladebit) and some 300 non compressed NFT plots.

Edit: In my opinion this should produce an error message in the logs, maybe even critical, but not stopping the harvester to work.

javanaut-de avatar Jul 03 '23 07:07 javanaut-de

This issue has not been updated in 14 days and is now flagged as stale. If this issue is still affecting you and in need of further review, please comment on it with an update to keep it from auto closing in 7 days.

github-actions[bot] avatar Jul 17 '23 11:07 github-actions[bot]

I periodically still getting a GRResult error GRResult is not GRResult_OK, received GRResult_OutOfMemory This was just GRResult is not GRResult_OK on alpha 4.5 No errors in Event Viewer stops sending partials chia start harvester -r resets and starts working ok Occurs about every 1-3 days.

Harvester only, no other activity on server. Alpha 4.6 (and had them on Alpha 4.5) NVidia Tesla P4, issues with drivers: 528.89, 536.25 HP Apollo 4200 Windows Server 2019 E5-2678v3, 64GB, all locally attached SAS,SATA drives. 3,434 C7 plots

Kinda leaving this box as is for testing this issue. Have other similar systems (> 20 harvesters) with A2000 6GB GPU with 4k - 15k mainly C5 plots and CPU compressed plots and no issues on them.

robcirrus avatar Jul 17 '23 18:07 robcirrus

Can you try this with the release candidate. Let us know if you still see issues. Thanks

wjblanke avatar Jul 26 '23 16:07 wjblanke

I am running rc1. I am getting GRR error. Debian 12 nvidia driver 535.86.05 with a Tesla p4 as harvester.

ericgr3gory avatar Jul 26 '23 19:07 ericgr3gory

Which GRResult specifically is it showing?

harold-b avatar Jul 27 '23 00:07 harold-b

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Synergy1900 avatar Aug 07 '23 13:08 Synergy1900

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

wallentx avatar Aug 07 '23 14:08 wallentx

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

Installed the rc3. I will evaluate for the next couple of days. Thx!

Synergy1900 avatar Aug 07 '23 17:08 Synergy1900

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

Installed the rc3. I will evaluate for the next couple of days. Thx!

Same result on RC3. Harvester stopped sending partials after the same error occured.

Synergy1900 avatar Aug 08 '23 09:08 Synergy1900

after replacing gpu(1070) with rtx2080ti i stopped getting GRResult errors

kinomexanik avatar Aug 20 '23 15:08 kinomexanik

Same issue here with Chia 2.0.0rc1 on Ubuntu 22.04.2 and Nvidia P4. (7000 c7 plots) Have to restart the harvester every hour to keep farming.

Can you try rc3? We added several architectures explicitly to the harvester and plotter

Installed the rc3. I will evaluate for the next couple of days. Thx!

Same result on RC3. Harvester stopped sending partials after the same error occured.

Same with RC6

Synergy1900 avatar Aug 21 '23 14:08 Synergy1900

in these cases where the harvester drops out, do you see a message in dmesg about NVIDIA driver, or a windows hardware event for NVIDIA? Does the driver drop out and recover? Do you see anything else in the log about which GRR event was logged after GRRResult is not GRResult_OK ?

jmhands avatar Aug 21 '23 21:08 jmhands

On my Windows Server 2019 Standard with Tesla P4 driver 536.25, E5-2697v3, 64GB ram. Just received latest ones earlier today, and it was giving 3 messages on same plot together. Have seen it report multiple consecutive errors sometimes, but not usually. No log items before/after indicating other issues.

Here's some log line entries before and after the 3 earlier today:

2023-08-21T15:25:58.561 harvester chia.harvester.harvester: INFO 5 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.51637 s. Total 3434 plots 2023-08-21T15:26:07.546 harvester chia.harvester.harvester: INFO 11 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.59371 s. Total 3434 plots 2023-08-21T15:26:15.999 harvester chia.harvester.harvester: INFO 7 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.34372 s. Total 3434 plots 2023-08-21T15:26:26.596 harvester chia.harvester.harvester: ERROR Exception fetching full proof for I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory 2023-08-21T15:26:26.596 harvester chia.harvester.harvester: ERROR File: I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot Plot ID: 50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7, challenge: d5a056ba8dfe416ecd1a7fdd3aca84aeec2e08a93554df9b83e087d332b1b992, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x00000240AC939630>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: 4c288e3a30931f7882607f8d0a9b3773322fb6cead8d292146103441f259c86b>, plot_public_key=<G1Element ac60387671080d3fd20a7def2f2adfe21fd8eb8fa47ff1af9a22e1019358ba30cf2fb5abebe8a63c8e43b88627ed9be4>, file_size=87233802240, time_modified=1686811230.8092616) 2023-08-21T15:26:26.815 harvester chia.harvester.harvester: ERROR Exception fetching full proof for I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory 2023-08-21T15:26:26.815 harvester chia.harvester.harvester: ERROR File: I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot Plot ID: 50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7, challenge: d5a056ba8dfe416ecd1a7fdd3aca84aeec2e08a93554df9b83e087d332b1b992, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x00000240AC939630>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: 4c288e3a30931f7882607f8d0a9b3773322fb6cead8d292146103441f259c86b>, plot_public_key=<G1Element ac60387671080d3fd20a7def2f2adfe21fd8eb8fa47ff1af9a22e1019358ba30cf2fb5abebe8a63c8e43b88627ed9be4>, file_size=87233802240, time_modified=1686811230.8092616) 2023-08-21T15:26:27.002 harvester chia.harvester.harvester: ERROR Exception fetching full proof for I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot. GRResult is not GRResult_OK, received GRResult_OutOfMemory 2023-08-21T15:26:27.002 harvester chia.harvester.harvester: ERROR File: I:\BBF2P1C5\plot-k32-c05-2023-06-15-06-38-50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7.plot Plot ID: 50dfd1f68d8ba0806aa024ba004cd8eb92598285e4cd9d5cb1dbbc0220df6cd7, challenge: d5a056ba8dfe416ecd1a7fdd3aca84aeec2e08a93554df9b83e087d332b1b992, plot_info: PlotInfo(prover=<chiapos.DiskProver object at 0x00000240AC939630>, pool_public_key=None, pool_contract_puzzle_hash=<bytes32: 4c288e3a30931f7882607f8d0a9b3773322fb6cead8d292146103441f259c86b>, plot_public_key=<G1Element ac60387671080d3fd20a7def2f2adfe21fd8eb8fa47ff1af9a22e1019358ba30cf2fb5abebe8a63c8e43b88627ed9be4>, file_size=87233802240, time_modified=1686811230.8092616) 2023-08-21T15:26:27.002 harvester chia.harvester.harvester: INFO 6 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 1.09760 s. Total 3434 plots 2023-08-21T15:26:36.080 harvester chia.harvester.harvester: INFO 6 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.35933 s. Total 3434 plots 2023-08-21T15:26:44.877 harvester chia.harvester.harvester: INFO 9 plots were eligible for farming 68e12a41d5... Found 0 proofs. Time: 0.53123 s. Total 3434 plots

Nothing in the Application nor System event viewer, no errors, no warnings, nothing about NVidia drivers.

robcirrus avatar Aug 21 '23 21:08 robcirrus

Does the harvester log show any GRResult_Failed messages at any point?

harold-b avatar Aug 21 '23 22:08 harold-b

in these cases where the harvester drops out, do you see a message in dmesg about NVIDIA driver, or a windows hardware event for NVIDIA? Does the driver drop out and recover? Do you see anything else in the log about which GRR event was logged after GRRResult is not GRResult_OK ?

Hi,

Found no messages in dmesg. Once it happens I'm getting the same message GRRResult is not GRResult_OK until I restart the harvester (chia start -r harvester). There are no other messages in the debug.log. After the upgrade to RC6 it worked for about a day before the first error occured again. Mostly it occurs randomly multiple times a day.

Regards S.

Ubuntu 22.04.2 LTS (256GB Memory) Nvidia Tesla P4 Driver Version: 535.86.10 CUDA Version: 12.2

Synergy1900 avatar Aug 22 '23 08:08 Synergy1900

There is one theory. Check the debuglog file, is there an Invalid proof of space error? I used to have a gtx1070 and got a GRRResult error. Then I installed 2080ti and no more GRRResult error. But periodically now I get the error Invalid proof of space. After checking these plots, I see a bad plot. I mean, the GRRResult error may be due to the fact that there are bad plots.

kinomexanik avatar Aug 22 '23 09:08 kinomexanik

"2023-08-15T13:00:52.627 farmer chia.farmer.farmer : ERROR Invalid proof of space: b4107dc0d19ecbc636828695d4b65b44038770ae2e575c66ecf8472dd07ed142 proof:........" plots check -g b4107dc0d19ecbc636828695d4b65b44038770ae2e575c66ecf8472dd07ed142 and it will check the plot, and tell you where its located.

kinomexanik avatar Aug 22 '23 09:08 kinomexanik

I already checked all my plots (multiple times), they are ok.

Synergy1900 avatar Aug 22 '23 11:08 Synergy1900

Does the harvester log show any GRResult_Failed messages at any point?

I'm not finding any of these in my logs.

Mine will typically go days before it gets the GRResult is not GRResult_OK, received GRResult_OutOfMemory error.

robcirrus avatar Aug 22 '23 15:08 robcirrus

Same issue for me. Happens every 1-2 days :(

Linux Mint 21.2 - full node i7/32 GB NVIDIA GeForce GTX 1050 Ti - 4 GB Driver Version: 535.86.10 CUDA Version: 12.2 Plots: 226 TB Compression: C7 Chia Version: 2.0.0

4ntibala avatar Aug 29 '23 08:08 4ntibala

I'm still getting the GRResult is not GRResult_OK, received GRResult_OutOfMemory on 2.0 release Installed 2.0 release several days ago. Getting it on the 2 almost identical systems with the Tesla P4, driver 536.25. Windows Server 2019, 2x E5-2650v4, 32GB, Tesla P4. Have 8 other similar systems with A2000 6GB and 23 with CPU compression harvesting, none of them have ever received the errors. About 70% C5 and 30% C7.

robcirrus avatar Aug 29 '23 13:08 robcirrus

Does the harvester log show any GRResult_Failed messages at any point?

~~Tested the GPU with compression and noticed that the error GRResult is not GRResult_OK, received GRResult_OutOfMemory appeared after the difficulty of the contract was switched. At the same time, the harvester and farmer continued to work with plots without compression. After restarting the harvester only and with the contract difficulty unchanged, everything works for more than a day, before switching the difficulty, it also farmed for about three days.~~

Ubuntu 22.10 P106-90 6gb Driver Version: 535.86.10 CUDA Version: 12.2 Compression: C6 Chia Version: 2.0.0

imba-pericia avatar Aug 31 '23 10:08 imba-pericia