chia-plotter icon indicating copy to clipboard operation
chia-plotter copied to clipboard

thread failed with: position out of bounds (phase 3)

Open therealflinchy opened this issue 4 years ago • 18 comments
trafficstars

get this at some point in phase 3, every single time without fail. Have not had a plot succeed yet using this.

spoken to a few other people with the same issue. not everyone gets it every time, but with frequency. Screenshot from 2021-06-08 10-58-44

therealflinchy avatar Jun 07 '21 23:06 therealflinchy

Same issue here

kastermight avatar Jun 08 '21 03:06 kastermight

hmm that's a sanity check, basically an entry in one table is pointing to an entry that doesn't exist in the other...

I will try to figure it out later. But I've done >40 plots without an issue.

madMAx43v3r avatar Jun 08 '21 06:06 madMAx43v3r

Try latest master, at least we'll know the position then.

madMAx43v3r avatar Jun 08 '21 06:06 madMAx43v3r

I just added even more info, if it happens again please show logs of phase 3 (like OP), so we can compare to the number of entries written in previous table.

madMAx43v3r avatar Jun 08 '21 07:06 madMAx43v3r

I've had something like that with original plotter with unstable RAM settings (set by the default XMP profile in BIOS). Found the fastest stable frequency/timings (memtest 1 pass no mistakes), and everything became alright.

hidoba avatar Jun 08 '21 09:06 hidoba

I just added even more info, if it happens again please show logs of phase 3 (like OP), so we can compare to the number of entries written in previous table.

Synced about an hour ago, running 16 threads (on a 16 core 32 thread system):

Plot Name: plot-k32-2021-06-08-12-09-d80c9c350bf8b16f6443db24b83fe662622add8ce36d6c99091214d073d5ed9f [P1] Table 1 took 12.3721 sec [P1] Table 2 took 110.631 sec, found 4294943854 matches [P1] Table 3 took 118.607 sec, found 4294934865 matches [P1] Table 4 took 142.453 sec, found 4294779041 matches [P1] Table 5 took 139.674 sec, found 4294670270 matches [P1] Table 6 took 137.278 sec, found 4294282787 matches [P1] Table 7 took 109.983 sec, found 4293601671 matches Phase 1 took 771.011 sec [P2] max_table_size = 4294967296 [P2] Table 7 scan took 17.4629 sec [P2] Table 7 rewrite took 47.6267 sec, dropped 0 entries (0 %) [P2] Table 6 scan took 35.5224 sec [P2] Table 6 rewrite took 72.3529 sec, dropped 581360985 entries (13.538 %) [P2] Table 5 scan took 33.1653 sec [P2] Table 5 rewrite took 70.7195 sec, dropped 762093332 entries (17.7451 %) [P2] Table 4 scan took 31.3774 sec [P2] Table 4 rewrite took 68.5294 sec, dropped 828879275 entries (19.2997 %) [P2] Table 3 scan took 31.671 sec [P2] Table 3 rewrite took 68.6792 sec, dropped 855113261 entries (19.9098 %) [P2] Table 2 scan took 31.6745 sec [P2] Table 2 rewrite took 68.405 sec, dropped 865598811 entries (20.1539 %) Phase 2 took 595.253 sec Wrote plot header with 268 bytes [P3-1] Table 2 took 77.8985 sec, wrote 3429345043 right entries [P3-2] Table 2 took 86.3875 sec, wrote 3429345043 left entries, 3429345043 final [P3-1] Table 3 took 59.8152 sec, wrote 3439821604 right entries [P3-2] Table 3 took 89.0155 sec, wrote 3439821604 left entries, 3439821604 final [P3-1] Table 4 took 60.3178 sec, wrote 3465899766 right entries [P3-2] Table 4 took 88.3687 sec, wrote 3465899766 left entries, 3465899766 final [P3-1] Table 5 took 58.7819 sec, wrote 3532572842 right entries [P3-2] Table 5 took 94.2765 sec, wrote 3532572842 left entries, 3532572842 final terminate called after throwing an instance of 'std::runtime_error' what(): thread failed with: position out of bounds (3532124949 + max(447658,447931)) Aborted (core dumped)

xydreen avatar Jun 08 '21 12:06 xydreen

Interesting, we can see that 3532572842 left entries were written in the previous stage 2, and now it's trying to access index 3532124949 + 447931 = 3532572880, which is 38 entries beyond the end of table 5.

Looking further back to phase 2 we can see that 762093332 entries were dropped from the initial 4294670270 matches of table 5 in phase 1. Subtracting those we can say there should be 3532576938 entries in table 5 after phase 2, however in phase 3 stage 1 only 3532572842 entries were written to table 5, which is exactly 4096 less than expected. 4096 is the size of the write cache.

Either the missing 4096 entries were lost when writing table 5 in phase 2, or they were lost when writing table 5 in phase 3 stage 2...

madMAx43v3r avatar Jun 08 '21 13:06 madMAx43v3r

Dang, I found the issue, that's a serious mishap. Fix pending...

madMAx43v3r avatar Jun 08 '21 13:06 madMAx43v3r

Dang, I found the issue, that's a serious mishap. Fix pending...

I blame math. Math always wins :(.

Interestingly, I started this back up on 4 threads instead of 16 to see if the amount of threads may for some reason be confusing something, and all throughout the first phase:

[P1] Table 1 took 29.6496 sec [P1] Table 2 took 202.767 sec, found 4294995349 matches [P1] Lost 27524 matches due to 32-bit overflow. [P1] Table 3 took 214.641 sec, found 4295103442 matches [P1] Lost 136634 matches due to 32-bit overflow. [P1] Table 4 took 269.252 sec, found 4295182971 matches [P1] Lost 214699 matches due to 32-bit overflow. [P1] Table 5 took 265.492 sec, found 4295291374 matches [P1] Lost 325441 matches due to 32-bit overflow. [P1] Table 6 took 252.03 sec, found 4295306499 matches [P1] Lost 336948 matches due to 32-bit overflow. [P1] Table 7 took 200.589 sec, found 4295379167 matches Phase 1 took 1434.44 sec

Although this was mentioned as "fine" in a previous thread, it's just interesting to see it here and not in the previous 16 thread run. I'm sure there's gotta be a reason for it.

Anyway, thanks a bunch!

xydreen avatar Jun 08 '21 13:06 xydreen

The issue is fixed in latest master. The same bug could cause matches to get lost in phase 1 also. But in phase 3 any lost entries in phase 2 would lead to the error discussed here.

I suppose this issue only happened on certain OS or hardware, depending on the implementation of fwrite(), ie. if it's thread-safe internally or not. And of course with increasing thread count.

The bug was a missing synchronization when writing sort buckets, if two threads were writing at the same time the file offset could have been advanced in parallel, writing twice to the same location and thus losing data.

EDIT: The same applies to incrementing the num_entries counter, that was also not synchronized before. I missed that during a refactor at some point.

madMAx43v3r avatar Jun 08 '21 13:06 madMAx43v3r

To anyone who had this issue at some point, check all your plots with ProofOfSpace check -f *.plot.

madMAx43v3r avatar Jun 08 '21 13:06 madMAx43v3r

@madMAx43v3r tried running it on my machine beacause I was curious. But it failed and I am too dumb to fix it myself so if you have any ideas let me know xD

./tests/cpp/test.cpp:88: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGSEGV - Segmentation violation signal

===============================================================================
test cases: 2 | 1 passed | 1 failed
assertions: 6 | 5 passed | 1 failed

Segmentation fault (core dumped)

BitSec01 avatar Jun 08 '21 15:06 BitSec01

git clone https://github.com/Chia-Network/chiapos.git
cd chiapos
mkdir build
cd build
cmake ..
make -j8
./ProofOfSpace check -f /path/to/plot-k32-*.plot

madMAx43v3r avatar Jun 08 '21 15:06 madMAx43v3r

sweet, thanks, gets all the way through a plot now!

therealflinchy avatar Jun 08 '21 15:06 therealflinchy

The issue is fixed in latest master. The same bug could cause matches to get lost in phase 1 also. But in phase 3 any lost entries in phase 2 would lead to the error discussed here.

I suppose this issue only happened on certain OS or hardware, depending on the implementation of fwrite(), ie. if it's thread-safe internally or not. And of course with increasing thread count.

The bug was a missing synchronization when writing sort buckets, if two threads were writing at the same time the file offset could have been advanced in parallel, writing twice to the same location and thus losing data.

EDIT: The same applies to incrementing the num_entries counter, that was also not synchronized before. I missed that during a refactor at some point.

Hey just a couple updates. I have had zero issues generating plots with your newly pushed master branch.

NEW 16 NVME (3x3.84TB RAID0):

Phase 1 took 756.703 sec Phase 2 took 590.655 sec Phase 3 took 1068.7 sec, wrote 21875149723 entries to final plot Phase 4 took 84.4367 sec, final plot size is 108823317007 bytes Total plot creation time was 2500.56 sec

NEW 16 RAMD:

Phase 1 took 745.945 sec Phase 2 took 606.673 sec Phase 3 took 1063.5 sec, wrote 21877310571 entries to final plot Phase 4 took 84.5115 sec, final plot size is 108836168451 bytes Total plot creation time was 2500.69 sec

My poor RAM LOL!

Anyway, thanks for pushing that fix!

xydreen avatar Jun 08 '21 16:06 xydreen

I'm glad it works now. Thanks for helping to fix it.

Those times are seriously close, are you sure you used the RAM disk?

madMAx43v3r avatar Jun 08 '21 16:06 madMAx43v3r

I'm glad it works now. Thanks for helping to fix it.

Those times are seriously close, are you sure you used the RAM disk?

Thank you!

And oh yeah, yeah this RAM is slow as hell. This is an enterprise server so the RAM is sadly slower than 3 x NVMEs in RAID 0. Sadly. Sadface.

xydreen avatar Jun 08 '21 16:06 xydreen

I'm still getting this today. I am running the latest code in Master. I'm using -r 12.

System

Ubuntu 20.04.1 Kernel 5.11.0-27-generic Ryzen 5 5600x processor 16 GB RAM 1 TB FireCuda M.2 Drive (Used for Temp space)

Any ideas?

Edit: I realized I may have posted this in the wrong issue. I'm getting this in Phase 1 Table 3 (generally).

hollyschilling avatar Aug 29 '21 23:08 hollyschilling