subspace
subspace copied to clipboard
Stack-overflow during initial re-commit after sync is done
When plotting larger plots, like 18700G (ie. 160 plots) the farmer crashes with a stack-overflow during the initial re-commit after the sync is done.
At this point you see these messages, one for each plot:
2022-06-15T19:10:46.002494Z INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.002519Z INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.002588Z INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.002640Z INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.003009Z INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.004324Z INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.004353Z INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.004395Z INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.004397Z INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.004455Z INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
After that it continues to archive segments, until the crash:
^[[2m2022-06-15T18:33:16.314271Z^[[0m ^[[32m INFO^[[0m ^[[2msubspace_farmer::archiving^[[0m^[[2m:^[[0m Plotted segment ^[[3msegment_index^[[0m^[[2m=^[[0m109991
^[[2m2022-06-15T18:35:29.977825Z^[[0m ^[[32m INFO^[[0m ^[[2msubspace_farmer::archiving^[[0m^[[2m:^[[0m Plotted segment ^[[3msegment_index^[[0m^[[2m=^[[0m109992
^[[2m2022-06-15T18:36:28.784317Z^[[0m ^[[32m INFO^[[0m ^[[2msubspace_farmer::archiving^[[0m^[[2m:^[[0m Plotted segment ^[[3msegment_index^[[0m^[[2m=^[[0m109993
^[[2m2022-06-15T18:37:23.277314Z^[[0m ^[[32m INFO^[[0m ^[[2msubspace_farmer::archiving^[[0m^[[2m:^[[0m Plotted segment ^[[3msegment_index^[[0m^[[2m=^[[0m109994
^[[2m2022-06-15T18:38:01.182120Z^[[0m ^[[32m INFO^[[0m ^[[2msubspace_farmer::archiving^[[0m^[[2m:^[[0m Plotted segment ^[[3msegment_index^[[0m^[[2m=^[[0m109995
thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow
Commitment DBs for salt 79f25142c329a0d0 have around 350M in size after the crash, compared to 450M for a finished process.
So none of the re-commits did actually finish.
Initial idea is big futures due to FuturesUnordered that generates a big future that is stack allocated. Should likely be heap-allocated and all relevant places need to be checked.
I did notice the node fell behind due to IO bottleneck. I did not see the "waiting for farmer to accept ..." messages though.
Can you run it with gdb to catch where it crashes exactly?
I limited the maximum parallel re-commit threads to 32 and it went through without crashing. The node also didn't fall behind during the rehash this time. So my guess the crash is related to the node falling behind due to IO bottleneck, there were a lot of "Creating data providers took longer than we had for the slot" messages.
How did you limit number of threads exactly?
Dirty hack, a global counter barrier in the worker threads.
I tried to reproduce on 5900X with modified code that creates 160 commitments for the same plot and it succeeded in ~5200 seconds (NVMe SSD).
So far not really sure how to reproduce :confused:
SSD is probably fast enough to not have the node fall behind. I tested on 24 disk (10k, 2.5 inch) RAID6, with the node and farmer data both on that array.
Maybe just test a normal setup, and hammer the drive with some benchmarks to make the node fall behind but not wait on farmer.
I did such things in the past and never had a crash like this on smaller plot size.
Then it may depend on the large plot size after all.
Btw it happened 3 times out of 3 for me (on same machine), so it's not random.
I think this should no longer be the case with Gemini 2a (we have applied some concurrency restrictions for disk I/O and other tweaks), if it still is, we'd need more details to resume work on this.