subspace icon indicating copy to clipboard operation
subspace copied to clipboard

Stack-overflow during initial re-commit after sync is done

Open madMAx43v3r opened this issue 3 years ago • 12 comments

When plotting larger plots, like 18700G (ie. 160 plots) the farmer crashes with a stack-overflow during the initial re-commit after the sync is done.

At this point you see these messages, one for each plot:

2022-06-15T19:10:46.002494Z  INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.002519Z  INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.002588Z  INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.002640Z  INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.003009Z  INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.004324Z  INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.004353Z  INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.004395Z  INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.004397Z  INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0
2022-06-15T19:10:46.004455Z  INFO subspace_farmer::farming: Salt updated, recommitting in background new_salt=79f25142c329a0d0

After that it continues to archive segments, until the crash:

^[[2m2022-06-15T18:33:16.314271Z^[[0m ^[[32m INFO^[[0m ^[[2msubspace_farmer::archiving^[[0m^[[2m:^[[0m Plotted segment ^[[3msegment_index^[[0m^[[2m=^[[0m109991
^[[2m2022-06-15T18:35:29.977825Z^[[0m ^[[32m INFO^[[0m ^[[2msubspace_farmer::archiving^[[0m^[[2m:^[[0m Plotted segment ^[[3msegment_index^[[0m^[[2m=^[[0m109992
^[[2m2022-06-15T18:36:28.784317Z^[[0m ^[[32m INFO^[[0m ^[[2msubspace_farmer::archiving^[[0m^[[2m:^[[0m Plotted segment ^[[3msegment_index^[[0m^[[2m=^[[0m109993
^[[2m2022-06-15T18:37:23.277314Z^[[0m ^[[32m INFO^[[0m ^[[2msubspace_farmer::archiving^[[0m^[[2m:^[[0m Plotted segment ^[[3msegment_index^[[0m^[[2m=^[[0m109994
^[[2m2022-06-15T18:38:01.182120Z^[[0m ^[[32m INFO^[[0m ^[[2msubspace_farmer::archiving^[[0m^[[2m:^[[0m Plotted segment ^[[3msegment_index^[[0m^[[2m=^[[0m109995

thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow

Commitment DBs for salt 79f25142c329a0d0 have around 350M in size after the crash, compared to 450M for a finished process. So none of the re-commits did actually finish.

madMAx43v3r avatar Jun 15 '22 19:06 madMAx43v3r

Initial idea is big futures due to FuturesUnordered that generates a big future that is stack allocated. Should likely be heap-allocated and all relevant places need to be checked.

nazar-pc avatar Jun 15 '22 19:06 nazar-pc

I did notice the node fell behind due to IO bottleneck. I did not see the "waiting for farmer to accept ..." messages though.

madMAx43v3r avatar Jun 15 '22 20:06 madMAx43v3r

Can you run it with gdb to catch where it crashes exactly?

nazar-pc avatar Jun 16 '22 08:06 nazar-pc

I limited the maximum parallel re-commit threads to 32 and it went through without crashing. The node also didn't fall behind during the rehash this time. So my guess the crash is related to the node falling behind due to IO bottleneck, there were a lot of "Creating data providers took longer than we had for the slot" messages.

madMAx43v3r avatar Jun 16 '22 11:06 madMAx43v3r

How did you limit number of threads exactly?

nazar-pc avatar Jun 16 '22 11:06 nazar-pc

Dirty hack, a global counter barrier in the worker threads.

madMAx43v3r avatar Jun 16 '22 12:06 madMAx43v3r

I tried to reproduce on 5900X with modified code that creates 160 commitments for the same plot and it succeeded in ~5200 seconds (NVMe SSD).

So far not really sure how to reproduce :confused:

nazar-pc avatar Jun 16 '22 12:06 nazar-pc

SSD is probably fast enough to not have the node fall behind. I tested on 24 disk (10k, 2.5 inch) RAID6, with the node and farmer data both on that array.

madMAx43v3r avatar Jun 16 '22 12:06 madMAx43v3r

Maybe just test a normal setup, and hammer the drive with some benchmarks to make the node fall behind but not wait on farmer.

madMAx43v3r avatar Jun 16 '22 12:06 madMAx43v3r

I did such things in the past and never had a crash like this on smaller plot size.

nazar-pc avatar Jun 16 '22 12:06 nazar-pc

Then it may depend on the large plot size after all.

madMAx43v3r avatar Jun 16 '22 12:06 madMAx43v3r

Btw it happened 3 times out of 3 for me (on same machine), so it's not random.

madMAx43v3r avatar Jun 16 '22 12:06 madMAx43v3r

I think this should no longer be the case with Gemini 2a (we have applied some concurrency restrictions for disk I/O and other tweaks), if it still is, we'd need more details to resume work on this.

nazar-pc avatar Sep 22 '22 14:09 nazar-pc