ryzen-test icon indicating copy to clipboard operation
ryzen-test copied to clipboard

Is this still valid?

Open jstarcher opened this issue 6 years ago • 51 comments

I'm wondering if any can confirm if this is still a valid test? I downloaded 17.04 and ran the tests as described and can't make it more than 200 seconds running all stock settings. I'm on a week 43 Ryzen 1700 and I can't seem to make anything else fail. 8hrs of Prime95, 8hrs of Memtest86, etc.

I've played with the DRAM voltage as well as SoC voltage and it didn't have any affect. One thing I noticed was that my integrated wifi adapter would throw a message in syslog and as soon as that happened this kill ryzen script would fail. I disabled the wifi adapter in bios and that allowed it to run longer which makes me wonder if this script fails on false positives?

jstarcher avatar Feb 21 '18 15:02 jstarcher

There is no official information on which CPUs are affected (or not affected). Your description here does fit the ryzen segfault bug. Prime95 and Memtest86 are not (as) sensitive to the bug as this workload. If you hit a segfault that rapidly (less than 5 minutes) and only one or a few processes fail (and some continue running), then your CPU is probably affected. If all processes fail within a short period it may be due to another problem.

You can still check the build logs in /mnt/ramdisk/workdir for problems other than a faulty CPU.

suaefar avatar Feb 21 '18 18:02 suaefar

Welp I think this script confirmed it. Another one for RMA. After much toying around (aka wasting valuable time) I tried the suggestion from #23 about disabling OpCode cache. As soon as I did that the kill ryzen script ran for about 20 minutes without crashing - by far the longest it has gone yet. I had to stop the script as I needed to get back on the machine but this proved that my CPU is affected as well.

YD1700BBM88AE UA 1743SUT

Very disappointing that AMD still hasn't gotten this under control. Now Newegg is giving me hassle about replacing it too. Ugh!

Anyway, thanks for the script and the response to this issue!

jstarcher avatar Feb 21 '18 20:02 jstarcher

@suaefar I just installed a new replacement I purchased and hit this AGAIN. The new CPU is a week 33: 1733PGS. Testing was the same - a fresh 17.04 flash drive.

sudo dmidecode -t memory | grep -i -E "(rank|speed|part)" | grep -v -i unknown Speed: 3200 MHz Part Number: F4-3200C14-8GFX Rank: 1 Configured Clock Speed: 1600 MHz Speed: 3200 MHz Part Number: F4-3200C14-8GFX Rank: 1 Configured Clock Speed: 1600 MHz uname -a Linux ubuntu 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux cat /proc/sys/kernel/randomize_va_space 2 / /mnt/ramdisk/workdir /mnt/ramdisk/workdir Using 16 parallel processes [KERN] -- Logs begin at Fri 2018-02-23 15:36:39 EST. -- [KERN] Feb 23 15:36:55 ubuntu systemd[1]: snapd.refresh.timer: Adding 3h 21min 14.449848s random time. [KERN] Feb 23 15:36:55 ubuntu systemd[1]: apt-daily.timer: Adding 2h 14min 40.656684s random time. [KERN] Feb 23 15:36:55 ubuntu systemd[1]: motd-news.timer: Adding 51min 54.931579s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: snapd.refresh.timer: Adding 1h 54min 27.129499s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: snapd.refresh.timer: Adding 44min 49.245281s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: apt-daily.timer: Adding 4h 55min 48.521122s random time. [KERN] Feb 23 15:37:05 ubuntu systemd[1]: motd-news.timer: Adding 18min 53.032133s random time. [KERN] Feb 23 15:39:10 ubuntu kernel: zram: Added device: zram0 [KERN] Feb 23 15:39:10 ubuntu kernel: zram0: detected capacity change from 0 to 68719476736 [KERN] Feb 23 15:39:10 ubuntu kernel: EXT4-fs (zram0): mounted filesystem with ordered data mode. Opts: discard [loop-0] Fri Feb 23 15:39:46 EST 2018 start 0 [loop-1] Fri Feb 23 15:39:47 EST 2018 start 0 [loop-2] Fri Feb 23 15:39:48 EST 2018 start 0 [loop-3] Fri Feb 23 15:39:49 EST 2018 start 0 [loop-4] Fri Feb 23 15:39:50 EST 2018 start 0 [loop-5] Fri Feb 23 15:39:51 EST 2018 start 0 [loop-6] Fri Feb 23 15:39:52 EST 2018 start 0 [loop-7] Fri Feb 23 15:39:53 EST 2018 start 0 [loop-8] Fri Feb 23 15:39:54 EST 2018 start 0 [loop-9] Fri Feb 23 15:39:55 EST 2018 start 0 [loop-10] Fri Feb 23 15:39:56 EST 2018 start 0 [loop-11] Fri Feb 23 15:39:57 EST 2018 start 0 [loop-12] Fri Feb 23 15:39:58 EST 2018 start 0 [loop-13] Fri Feb 23 15:39:59 EST 2018 start 0 [loop-14] Fri Feb 23 15:40:00 EST 2018 start 0 [loop-15] Fri Feb 23 15:40:01 EST 2018 start 0 [loop-12] Fri Feb 23 15:42:13 EST 2018 build failed [loop-12] TIME TO FAIL: 147 s [KERN] Feb 23 15:42:13 ubuntu kernel: traps: bash[32728] general protection ip:445b20 sp:7fff1ce38448 error:0 [KERN] Feb 23 15:42:13 ubuntu kernel: in bash[400000+100000] [KERN] Feb 23 15:42:26 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready [loop-8] Fri Feb 23 15:43:32 EST 2018 build failed [loop-8] TIME TO FAIL: 226 s [KERN] Feb 23 15:43:32 ubuntu kernel: bash[21958]: segfault at d ip 0000000000431f2e sp 00007ffc28f648c0 error 4 in bash[400000+100000] [KERN] Feb 23 15:47:41 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready [KERN] Feb 23 15:52:57 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready [KERN] Feb 23 15:58:12 ubuntu kernel: IPv6: ADDRCONF(NETDEV_UP): wlp9s0: link is not ready

Checking build-8 log I see: /bin/bash ../libtool --tag=CC --mode=link gcc -DNO_ASM -g -version-info 5:4:1 -static-libstdc++ -static-libgcc -o libmpfr.la -rpath /usr/local/lib exceptions.lo extract.lo uceil_exp2.lo uceil_log2.lo ufloor_log2.lo add.lo add1.lo add_ui.lo agm.lo clear.lo cmp.lo cmp_abs.lo cmp_si.lo cmp_ui.lo comparisons.lo div_2exp.lo div_2si.lo div_2ui.lo div.lo div_ui.lo dump.lo eq.lo exp10.lo exp2.lo exp3.lo exp.lo frac.lo frexp.lo get_d.lo get_exp.lo get_str.lo init.lo inp_str.lo isinteger.lo isinf.lo isnan.lo isnum.lo const_log2.lo log.lo modf.lo mul_2exp.lo mul_2si.lo mul_2ui.lo mul.lo mul_ui.lo neg.lo next.lo out_str.lo printf.lo vasprintf.lo const_pi.lo pow.lo pow_si.lo pow_ui.lo print_raw.lo print_rnd_mode.lo reldiff.lo round_prec.lo set.lo setmax.lo setmin.lo set_d.lo set_dfl_prec.lo set_exp.lo set_rnd.lo set_f.lo set_prc_raw.lo set_prec.lo set_q.lo set_si.lo set_str.lo set_str_raw.lo set_ui.lo set_z.lo sqrt.lo sqrt_ui.lo sub.lo sub1.lo sub_ui.lo rint.lo ui_div.lo ui_sub.lo urandom.lo urandomb.lo get_z_exp.lo swap.lo factorial.lo cosh.lo sinh.lo tanh.lo sinh_cosh.lo acosh.lo asinh.lo atanh.lo atan.lo cmp2.lo exp_2.lo asin.lo const_euler.lo cos.lo sin.lo tan.lo fma.lo fms.lo hypot.lo log1p.lo expm1.lo log2.lo log10.lo ui_pow.lo ui_pow_ui.lo minmax.lo dim.lo signbit.lo copysign.lo setsign.lo gmp_op.lo init2.lo acos.lo sin_cos.lo set_nan.lo set_inf.lo set_zero.lo powerof2.lo gamma.lo set_ld.lo get_ld.lo cbrt.lo volatile.lo fits_sshort.lo fits_sint.lo fits_slong.lo fits_ushort.lo fits_uint.lo fits_ulong.lo fits_uintmax.lo fits_intmax.lo get_si.lo get_ui.lo zeta.lo cmp_d.lo erf.lo inits.lo inits2.lo clears.lo sgn.lo check.lo sub1sp.lo version.lo mpn_exp.lo mpfr-gmp.lo mp_clz_tab.lo sum.lo add1sp.lo free_cache.lo si_op.lo cmp_ld.lo set_ui_2exp.lo set_si_2exp.lo set_uj.lo set_sj.lo get_sj.lo get_uj.lo get_z.lo iszero.lo cache.lo sqr.lo int_ceil_log2.lo isqrt.lo strtofr.lo pow_z.lo logging.lo mulders.lo get_f.lo round_p.lo erfc.lo atan2.lo subnormal.lo const_catalan.lo root.lo sec.lo csc.lo cot.lo eint.lo sech.lo csch.lo coth.lo round_near_x.lo constant.lo abort_prec_max.lo stack_interface.lo lngamma.lo zeta_ui.lo set_d64.lo get_d64.lo jn.lo yn.lo rem1.lo get_patches.lo add_d.lo sub_d.lo d_sub.lo mul_d.lo div_d.lo d_div.lo li2.lo rec_sqrt.lo min_prec.lo buildopt.lo digamma.lo bernoulli.lo isregular.lo set_flt.lo get_flt.lo scale2.lo set_z_exp.lo ai.lo gammaonethird.lo grandom.lo -lgmp ßßßßßßßßßßß^K Makefile:518: recipe for target 'libmpfr.la' failed make[5]: *** [libmpfr.la] Segmentation fault (core dumped) make[5]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:446: recipe for target 'all' failed make[4]: *** [all] Error 2 make[4]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:468: recipe for target 'all-recursive' failed make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr' Makefile:6475: recipe for target 'all-stage1-mpfr' failed make[2]: *** [all-stage1-mpfr] Error 2 make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:27079: recipe for target 'stage1-bubble' failed make[1]: *** [stage1-bubble] Error 2 make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:941: recipe for target 'all' failed make: *** [all] Error 2 /bin/bash ../libtool --tag=CC --mode=link gcc -DNO_ASM -g -version-info 5:4:1 -static-libstdc++ -static-libgcc -o libmpfr.la -rpath /usr/local/lib exceptions.lo extract.lo uceil_exp2.lo uceil_log2.lo ufloor_log2.lo add.lo add1.lo add_ui.lo agm.lo clear.lo cmp.lo cmp_abs.lo cmp_si.lo cmp_ui.lo comparisons.lo div_2exp.lo div_2si.lo div_2ui.lo div.lo div_ui.lo dump.lo eq.lo exp10.lo exp2.lo exp3.lo exp.lo frac.lo frexp.lo get_d.lo get_exp.lo get_str.lo init.lo inp_str.lo isinteger.lo isinf.lo isnan.lo isnum.lo const_log2.lo log.lo modf.lo mul_2exp.lo mul_2si.lo mul_2ui.lo mul.lo mul_ui.lo neg.lo next.lo out_str.lo printf.lo vasprintf.lo const_pi.lo pow.lo pow_si.lo pow_ui.lo print_raw.lo print_rnd_mode.lo reldiff.lo round_prec.lo set.lo setmax.lo setmin.lo set_d.lo set_dfl_prec.lo set_exp.lo set_rnd.lo set_f.lo set_prc_raw.lo set_prec.lo set_q.lo set_si.lo set_str.lo set_str_raw.lo set_ui.lo set_z.lo sqrt.lo sqrt_ui.lo sub.lo sub1.lo sub_ui.lo rint.lo ui_div.lo ui_sub.lo urandom.lo urandomb.lo get_z_exp.lo swap.lo factorial.lo cosh.lo sinh.lo tanh.lo sinh_cosh.lo acosh.lo asinh.lo atanh.lo atan.lo cmp2.lo exp_2.lo asin.lo const_euler.lo cos.lo sin.lo tan.lo fma.lo fms.lo hypot.lo log1p.lo expm1.lo log2.lo log10.lo ui_pow.lo ui_pow_ui.lo minmax.lo dim.lo signbit.lo copysign.lo setsign.lo gmp_op.lo init2.lo acos.lo sin_cos.lo set_nan.lo set_inf.lo set_zero.lo powerof2.lo gamma.lo set_ld.lo get_ld.lo cbrt.lo volatile.lo fits_sshort.lo fits_sint.lo fits_slong.lo fits_ushort.lo fits_uint.lo fits_ulong.lo fits_uintmax.lo fits_intmax.lo get_si.lo get_ui.lo zeta.lo cmp_d.lo erf.lo inits.lo inits2.lo clears.lo sgn.lo check.lo sub1sp.lo version.lo mpn_exp.lo mpfr-gmp.lo mp_clz_tab.lo sum.lo add1sp.lo free_cache.lo si_op.lo cmp_ld.lo set_ui_2exp.lo set_si_2exp.lo set_uj.lo set_sj.lo get_sj.lo get_uj.lo get_z.lo iszero.lo cache.lo sqr.lo int_ceil_log2.lo isqrt.lo strtofr.lo pow_z.lo logging.lo mulders.lo get_f.lo round_p.lo erfc.lo atan2.lo subnormal.lo const_catalan.lo root.lo sec.lo csc.lo cot.lo eint.lo sech.lo csch.lo coth.lo round_near_x.lo constant.lo abort_prec_max.lo stack_interface.lo lngamma.lo zeta_ui.lo set_d64.lo get_d64.lo jn.lo yn.lo rem1.lo get_patches.lo add_d.lo sub_d.lo d_sub.lo mul_d.lo div_d.lo d_div.lo li2.lo rec_sqrt.lo min_prec.lo buildopt.lo digamma.lo bernoulli.lo isregular.lo set_flt.lo get_flt.lo scale2.lo set_z_exp.lo ai.lo gammaonethird.lo grandom.lo -lgmp ßßßßßßßßßßß^K Makefile:518: recipe for target 'libmpfr.la' failed make[5]: *** [libmpfr.la] Segmentation fault (core dumped) make[5]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:446: recipe for target 'all' failed make[4]: *** [all] Error 2 make[4]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr/src' Makefile:468: recipe for target 'all-recursive' failed make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8/mpfr' Makefile:6475: recipe for target 'all-stage1-mpfr' failed make[2]: *** [all-stage1-mpfr] Error 2 make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:27079: recipe for target 'stage1-bubble' failed make[1]: *** [stage1-bubble] Error 2 make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-8' Makefile:941: recipe for target 'all' failed make: *** [all] Error 2

and loop-12: Makefile:864: recipe for target 'libgmp.la' failed make[5]: *** [libgmp.la] Segmentation fault (core dumped) make[5]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12/gmp' Makefile:954: recipe for target 'all-recursive' failed make[4]: *** [all-recursive] Error 1 make[4]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12/gmp' Makefile:773: recipe for target 'all' failed make[3]: *** [all] Error 2 make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12/gmp' Makefile:5521: recipe for target 'all-stage1-gmp' failed make[2]: *** [all-stage1-gmp] Error 2 make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12' Makefile:27079: recipe for target 'stage1-bubble' failed make[1]: *** [stage1-bubble] Error 2 make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-12' Makefile:941: recipe for target 'all' failed make: *** [all] Error 2

Do you think I'm really that unlucky to get two post week 25 chips with the bug? I've already replaced the motherboard as well.

Thanks!

jstarcher avatar Feb 23 '18 21:02 jstarcher

You should try to run the memory at stock settings, just to be sure. Unstable memory also can result in segfaults.

m-r-s avatar Feb 23 '18 21:02 m-r-s

I’ve tried JDEC SPD, XMP, and everything in between with the same results. Also tired bumping DRAM and SOC above the XMP values of 1.35v and 1.1v respectively to no avail.

I just tried something new - popped out one stick of ram so I’m down to 8gb. The test ran for 10 minutes before running out of memory. That’s 8 minutes longer than ever before so maybe it’s ram after all or maybe the bug affects the memory controller?

Any recommendations on params to run with 8gb? 2 loops and 2 threads?

jstarcher avatar Feb 24 '18 02:02 jstarcher

I'm in the same boat as you. My first R7 1700 was 1734 and had the bug show up. went through the RMA process with NewEgg and just received... 1734PGS ... same week, showing segfaults so far... lets see if it segfaults at 1.35 volts...

disturbednny avatar Feb 24 '18 05:02 disturbednny

Then you both probably got a faulty CPUs again. We don't know what exactly is wrong, possibly something memory-related... the controller, or cache coherency. We don't know how to distinguish good from bad ones (AMD did not tell us, maybe even they don't know). The only tool we have is to run workloads on these CPUs which are likely to trigger the behavior.

I cannot understand how AMD gets away with this. There must be thousands of faulty CPUs around, and they still sell them :(

I am deeply disappointed.

With 8Gb RAM better go for 3 loops 5 threads, or 2 loops 8 threads.

Good luck!

m-r-s avatar Feb 24 '18 09:02 m-r-s

Thanks! Yet a third CPU will be here tomorrow. I’m thinking it might be time to look at getting a class action suit together to get them talk. I used to love AMD but this is beyond ridiculous!

jstarcher avatar Feb 24 '18 14:02 jstarcher

where do you live that NewEgg RMAs so fast? or did you buy from another source? what mobo and ram do you have? I'm even trying older BIOS versions to see if that might help.. hasn't so far. I use this PC for work and have lost hours to this, i need a place that will do an advanced replacement... or suck it up and get a zen+ processor when they come out. hopefully its not in that micro architecture as well...

disturbednny avatar Feb 24 '18 14:02 disturbednny

Newegg refused to exchange it because it’s past 30 days. I was fighting some other issues and decided to RMA the motherboard first. By time I switched it out the motherboard Newegg closed my RMA for the CPU. They also pissed me off because they wouldn’t take the motherboard back because I had sent the UPC in for the rebate.

This time I ordered from Amazon. I also use my PC for work as I work from home and couldn’t afford downtime. Newegg won’t do advanced replacement or returns on CPUs btw. I ended up filing a claim with my credit card for the return protection because of this mess.

I finally got back a response from AMD days later and they approved an RMA no questions asked and gave me a 2 day label.

Amazon was a once click exchange and they do advanced replacement so hopefully the one coming tomorrow is not bugged. If it is, I’ll go through the AMD RMA.

One way or another I’m getting to the bottom of this. I compile code on Linux for work and need this stable!

I’ll send that back an

jstarcher avatar Feb 24 '18 18:02 jstarcher

Motherboard is an ASRock X370 Taichi. I’ve tried different bios versions without luck. For memory I’ve got 2x8gb GSkill FlareX.

jstarcher avatar Feb 24 '18 18:02 jstarcher

I have the same ram as you but have the aorus gaming k7. I think I'm going to wait until zen+ comes out to rma it, then sell it because i seem to have bad luck. At least this one is 100% stable with my RAMs xmp profile according to stressapptest and aida64

disturbednny avatar Feb 25 '18 15:02 disturbednny

Okay so I received my third CPU which is a 1744SUS this time and this script failed in about two minutes again with the segfault error. Given that this is the third post-week 25 chip I've had fail I'm pretty convinced at this point that something else is going on. Either something with the way this script runs on my machine (some weird thing when using zram?), memory settings, etc. I did experience some random lockups in both Windows and Linux without any MCE or BSODs with my first chip so I definitely thing that one had something wrong.

At this point I'm going to try running some real-world workloads and see if I can reproduce it. If so I'll dig deeper into the motherboard & ram. One other interesting note is that AMD told me "please update your motherboard BIOS to the latest version with AGESA 1.0.0.6b after installing the CPU" which I have but it does tell me something about the BIOS could be impacting this. I've been on 3.30 but I'll try a few other versions.

jstarcher avatar Feb 26 '18 17:02 jstarcher

A short update on the segfault saga: I've determined that disabling ASLR does indeed workaround the segfault issue or at least make it so I can't reproduce it with this script. I'm not sure I want to leave it disabled though as it is a small security risk running without it.

I also received my RMA replacement from AMD today and it's a 1733SUS. Funny thing is it seems to be very common to get this batch number when you RMA it for this issue so perhaps it's a "known good" batch. I'll get it installed this week and run some tests.

jstarcher avatar Mar 07 '18 20:03 jstarcher

Finally and end to all this. I was able to complete 12 hours of the ryzen test without any issues using the 1733SUS that AMD send as a replacement.

It's very peculiar that 3/3 of the retail purchases were bugged but AMD sent me a non-bugged item. I also notice MANY people are getting the 1733SUS back as a replacement. It makes me wonder if this is some sort of golden batch that is know working and AMD kept them to use for replacements. Meanwhile the other CPUs on the shelf are most likely bugged regardless of the week number, at least is my painful experiences.

So the answer is yes, this test is still valid. Thanks for putting this together and shame on AMD for selling known bugged CPUs!

jstarcher avatar Mar 21 '18 12:03 jstarcher

Thank you for sharing your story. It is unbelievable that they get away with this...

m-r-s avatar Mar 21 '18 22:03 m-r-s

I will re-open this issue until it finally is no issue anymore...

suaefar avatar Mar 22 '18 06:03 suaefar

I'm witnessing something interesting.

When I run kill-ryzen.sh with no parameters it runs a lot longer before failing compared to kill-ryzen.sh 4 4 and it failing under two minutes is it just how the processor is being stressed that causes the difference in rate if failure? I'm waiting until the 2700x has been out for a while before buying that as my replacement, and to make sure others test it to make sure the segfault bug doesn't exist with the refrrsh

disturbednny avatar Apr 14 '18 20:04 disturbednny

@disturbednny : what is the exact error you are hitting? Running 4 X 4 means you have for loops with 4 threads; without any parameter, you are running as many loops as there are threads on your CPU. Each loop will take longer to compile GCC. However, the stress will be similar or a bit higher with the latter. If it takes more time to fail with no parameters, that could indicate a problem with the compilation itself, not with the CPU.

Oxalin avatar Apr 15 '18 06:04 Oxalin

I'll have to run them again to get the segfault errors, but they are kernel segfault checks that show up when I type dmesg, and follow the error format in the log entries jstarcher posted with the line starting with make[5]

disturbednny avatar Apr 15 '18 21:04 disturbednny

Heres the dmesg output: [KERN] Apr 15 22:40:00 ubuntu kernel: traps: bash[12803] general protection ip:435bc4 sp:7ffe2774fec0 error:0 [KERN] Apr 15 22:40:05 ubuntu kernel: bash[18352]: segfault at 6e61c4 ip 000000000043d790 sp 00007ffe54c53900 error 6 in bash[400000+100000]

loop-2 log make[5]: *** [rint.lo] Segmentation fault (core dumped)

loop-0 log Makefile:761: recipe for target 'set_ui.lo' failed make[5]: *** [set_ui.lo] Segmentation fault (core dumped)

disturbednny avatar Apr 15 '18 22:04 disturbednny

This looks like you got a faulty Ryzen :( I wonder how many are still out there producing erroneous results every day...

m-r-s avatar Apr 17 '18 06:04 m-r-s

Probably many. People still report faulty CPUs as of week 48 in 2017: UA 1748PGS (https://community.amd.com/message/2857007#comment-2857007)

suaefar avatar Apr 27 '18 09:04 suaefar

Some Good news,

I received my R7 2700X this past saturday, and successfully ran the kill-ryzen script for 8 hours straight with no segfault. So it looks like it is not present in the R7 and R5 2000 series. RMA'd my 1700 after installing the 2700X so we'll see what they give me.

disturbednny avatar Apr 27 '18 13:04 disturbednny

That's good news! I was really hoping that they would get it under control eventually.

m-r-s avatar Apr 27 '18 14:04 m-r-s

Thank you very much for providing this test! A few days ago, I got a Ryzen 5 1600 (lot 1743SUS), which failed the test in under 3 minutes. The dealer was so kind to take it back and let me order a Ryzen 5 2600 (lot 1806SUT) as a replacement, which seems to work just fine.

7Z0t99 avatar May 12 '18 14:05 7Z0t99

Hi. So to be clear is disabling ASLR the answer to some of these issues? I've had a Ryzen 1700 1733PGS since Oct 2017 and the thing has been nothing but trouble. Dealing with this https://bugzilla.kernel.org/show_bug.cgi?id=196683 in additional to the general protection faults.

I am running latest AGESA. My memory has been tested ok. Basically I can reproduce a general protection fault very easily by just running something that uses several threads. For instance using Saltstack config management commands I could repo a fault just about every time I ran a somewhat intensive job with ASLR on. With ASLR off I get no protection faults.

Thanks!

infoveinx avatar Jun 18 '18 17:06 infoveinx

@infoveinx : short answer is we don't know. As long as AMD won't recognize and disclose the problem, we can't tell for sure.

Oxalin avatar Jun 18 '18 17:06 Oxalin

I've been building computers since the 1990s, and many of those have been specific for running Linux as either a deskstop or server for personal use. In all of those years I've never encountered the kind of problems I've seen with this CPU. I've used both AMD and Intel too. I feel like 2017/2018 have been the worst given these issues, and not to mention things like spectre/meltdown muddying the waters even more so.

AMD needs to open up about this issue, because it's quite obvious that there are real problems with this generation of processors. I know I'm preaching to the choir here. If you look at the link I posted above way down you'll see where folks have gone to several of the top techie news sites and reported some of the issues with these processors. They either get no response, or a response that states they aren't having the issues reported. I can't imagine this is the case with so many people reporting the same problems. It feels like one giant coverup if you ask me that even involves news and tech sites that test and write reviews on hardware.

infoveinx avatar Jun 18 '18 18:06 infoveinx

Basically AMD needs to release more info, because I don't think we fully understand this issue, hopefully 2700s are fixed as you say.

My UA1733PGS has been running with no issues since my other thread and upped voltages.

protox avatar Jun 27 '18 09:06 protox

Increased voltages mean increased power consumption, more heat, higher temperatures and possibly lower performance... it is a workaround but no fix. Nobody should need to touch the stock voltages to get a stable system.

m-r-s avatar Jun 27 '18 09:06 m-r-s

Yes there is obviously an inherent issue, wouldn't be surprised if it's a design flaw in the end.

protox avatar Jun 27 '18 10:06 protox

suaefar, thanks for the effort from your side to help isolate and reproduce the problem.

I've got a 1700 and three 1800X CPUs. I bought my 1700 in March 2017, I expected problems with new micro architecture as we have seen in the past. I was surprised when I found my 1800X CPUs are from the first week of production even though I bought them individually in December 2017 to January 2018.

I tried to RMA one (1707) in April 2018. I can't afford to stop using all of them and I wasn't sure if replacements would work. AMD accepted my request (thanks to your "ryzen-test"), however when it came to shipping address I found out that AMD does not support my country. It looks like I am in it for the long run.

Until a month ago I did not have so many problems, but since I starting using newer kernels >= 4.15 I've seen multiple crashes per day, resulting in filesystem corruption beyond the point that fsck will repair. I've tried various distributions/kernels, CPU pinning, hugepages to try and isolate workloads in virtual machines. Thus far my best was 188 days uptime on my 1700 using pve-manager/5.0-23/af4267bf (running kernel: 4.10.15-1-pve) it's based on debian 9.4. Strange observation that my machine with the most memory did the best, it has 4x16GB RAM.

I know this project aims to determine if your hardware is faulty or not. I'm begging for any advice on a workaround that would make my system(s) stable without spending a huge sum of money i.e. buying new CPUs or Windows license for each machine. Can we pressure AMD to assist kernel devs or provide more information on what they changed between chip revisions?

skarr avatar Jul 20 '18 23:07 skarr

Hi skarr,

disabling ASLR, µOP-caching and SMT was on some occasions reported to increase stability. Depending on your workload, it might also help to pin processes to certain CPUs (with "taskset"), but I am not sure.

Unfortunately, we know close to nothing because AMD never shared any bit of information on this issue with us. You bought a product which does not work as expected. I would simply return it.

suaefar avatar Jul 21 '18 06:07 suaefar

Thanks for the advice, I really appreciate it.

I found a product errata document by AMD in this post. This one stood out to me:

1109 MWAIT Instruction May Hang a Thread

Description Under a highly specific and detailed set of internal timing conditions, the MWAIT instruction may cause a thread to hang in SMT (Simultaneous Multithreading) Mode. Potential Effect on System The system may hang or reset. Suggested Workaround System software may contain the workaround for this erratum. Fix Planned No fix planned

I also found new responses in Kernel.org Bugzilla stating idle=nomwait fixed all hangs. I am in the process of testing this for myself. My long term strategy is to try these/other workarounds while I continue to the fight with the local suppliers. Looks like reddit users are happy with RMA process which is not helping my case. Now as many people obtain newer CPUs it looks like this issue is going in under the carpet, "nothing to see here please disperse".

I will create a new issue/update this one if I find anything useful.

skarr avatar Jul 21 '18 19:07 skarr

@skarr have a look at this project: https://github.com/qrwteyrutiyoup/ryzen-stabilizator

Disabling C6, ALSR, and enabling the power supply idle workaround helped me. Without these even my replacement “not bugged” CPU had random reboots and soft pickups on Ubuntu. I created a systemd startup unit to make these changes automatically.

TL;DR there’s other problems with first gen Ryzen in Linux outside of this compilation bug :(

jstarcher avatar Jul 26 '18 04:07 jstarcher

Finally did the RMA. They sent me a UA1733SUS, the one I sent back was a UA1733PGS. The new one is also not behaving. I reset bios and I'm also running latest bios. The only change I made under advanced CPU section was to enable the typical current idle, and SMV. I run about 5 qemu-kvm VMs. Basically the ones that actually do stuff are crashing with kernel panics randomly (just like before). If you let one sit long enough in panic state without killing it the host system will eventually have some kind of kernel issue and lock up. Generally the system is close to idle though as the VMs don't have much activity.

To be clear I'm now on a 4.17 kernel from Debian Stretch backports. I've used 4.12, 4.13, 4.14, 4.15, and 4.16 prior. All of them unstable (although once upon a time 4.13 had a long uptime) but that was after disabling C-states both in bios and in software, and disabling ASLR. Subsequent kernel versions didn't seem to make a difference even with all of that disabled. Even with 4.13 every so often I would see a VM go to 100% and have to be restarted, though much less frequent. I also had a much older BIOS at that time.

During the RMA process I moved all of the VM images back to an old 2012 Intel i3 that I had used prior. Not a single problem from that system in the week that I ran it and at times under heavy load. I was going to try a bunch of stuff like, CPU pinning with VMs etc, but I've read other folks tried that and it still crashed. I'm not going to continue trying to make this work. I'm just not going to buy AMD ever again. In fact if I must I will purchase older gen processors after doing research to ensure they can handle running VMs under Linux.

This has been a fight since November 2017 and I've lost countless hours. If anyone has suggestions I'm open to them, but at this point it seems like a lost fight and time to move on.

infoveinx avatar Aug 03 '18 15:08 infoveinx

Update. The initial crashes that I encountered with the replacement CPU were still under 4.16 kernel. The only incident I had with 4.17 kernel was starting the 5 linux VMs simultaneously and then one of them crashed not long after startup.

Something else I did was disable IOMMU in bios almost two days ago and I can't quite recall if I did this prior to the single 4.17 kernel incident. I let it sit mostly idle for a little over one day and didn't experience a crash or idle lockup. Today I tried to use every method prior to crash it and it never experienced a single hiccup. I'm not sure what to make of it yet so going to let it go longer and see what happens. Unfortunately in the past I've seen it crash anywhere from within minutes, to multiple days.

infoveinx avatar Aug 05 '18 00:08 infoveinx

Latest update. System went 7 days without issue but is now back to being completely unstable. During those 7 days I ran it through a gambit of things from normal tasks, to many simultaneous things involving network I/O, disk both SATA and USB 3.0 transfers along with some stress-ng runs. I had 6 Linux vms running on it and it never hinted a single issue. I find it on 7th day locked up with a kernel panic. Since then it is completely unstable, with or without vms running. It's very hard for me to understand how it can run so perfect and the suddenly become so unstable with no changes.

Some thoughts on this. I would assume if the mobo were bad that the behavior would have shown well before 7 days. The same for PSU. I did run RAM through an 8 hour memtest some time back and saw no problems. The only conclusion I can come to here is that the Linux kernel itself is just not working well with this CPU for whatever reason. The other thing I wondered is if perhaps something about the mobo is providing incorrect voltages and over time is degrading the CPU in someway.

I can't really deal with this any longer so I think for now the system will just get shelved and replaced with a previous gen Intel. I've been reading around again and I see several folks who are essentially dealing with the same kind of conditions. Ie, stable then completely unstable with general protection faults/segfaults and basic system lockups etc. Yes it does seem like many people who RMA are getting 2017 Week 33 replacements.

infoveinx avatar Aug 11 '18 16:08 infoveinx

Have you tried everything outlined here: http://blog.programster.org/stabilizing-ubuntu-16-04-on-ryzen

These changes seemed to really help me. At this point the only thing I get is a soft lockup on occasion which I’m 99% sure it’s an Nvidia driver issue. I can ssh to the machine and Xorg is locked up and I have errors in the Xorg log but I haven’t seen evidence of CPU instability. Also try with completely stock ram settings, not XMP profile. XMP isn’t guaranteed to be stable. Worth upping soc voltage to 1.1v if you haven’t yet as well.

jstarcher avatar Aug 11 '18 16:08 jstarcher

Yeah, thanks. I've tried all of those things and then some. The longest run I had was on kernel 4.13 (I'm running Debian 9.x so I'm using backports to get newer kernels). Basically it came down to disable C-states in bios, disable the remaining states via the Zenstates script, disable ASLR, and finally blacklisting nouveau driver. I went over 100 days uptime with that, however I still had the occasional VM lockup when I would do a heavy file transfer over network on the host system itself (not a VM on said system). Since then I've updated to latest bios and iterated over kernels 4.14, 4.15, 4.16, and now on 4.17.

I have also tried not doing XMP with no change in results. Running XMP is showing the RAM rated timings in bios fwiw. That 100+ days of uptime was also on the bios that came with mobo which was quite old and prior to the addition of the power supply configurable idle states that AMD added. I haven't had a soft lockup in a very long time, my issues all appear to be related to memory now. I also tried to disable SMT, Opcache, etc. In the end the only way to keep it working was to pass maxcpus=1 to kernel so that only a single CPU was used. In that case I was able to copy files across network (Samb) and back and forth to a USB 3.0 drive (as backup) with no crashes.

I'm not running Xorg or any GUI on this system, but I do have a Geforce 210 as the video card. I've yet to see kernel errors related to video. I tried adding voltage slowly and that seemed to increase problems (on the returned CPU), but I am willing to try it again with new CPU.

For clarity here are details of my system.

Gigabyte AB350-Gaming 3 Bios F23d CORSAIR CX-M Series CX550M 550W PSU Ryzen 7 1700 G.SKILL Flare X Series 32GB (4 x 8GB) 288-Pin DDR4 SDRAM DDR4 2400 Geforce 210 video card Intel EXPI9301CTBLK Network Adapter 10/100/1000Mbps PCI-Express SAMSUNG 850 PRO 512GB SSD HGST Deskstar NAS 3.5" 10TB x2 running in Raid 1 Debian 9.5 kernel 4.17 running on the Samsung SSD

infoveinx avatar Aug 11 '18 17:08 infoveinx

Also noticing these in boot log. Don't recall seeing them prior. I know there has been talk of fixes related to this in newer kernels. No idea how it relates but as stated prior I'm already on 4.17 kernel.

Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

infoveinx avatar Aug 11 '18 17:08 infoveinx

Did the following.

  • Reset bios to optimized defaults
  • Turned on SMV
  • Set PSU to typical current
  • Made sure XMP Profile disabled
  • Set VCORE SOC to 1.116 (fluctuates from 1.104 - 1.128)
  • Disabled IOMMU
  • CSM mode with UEFI for boot devices

System is still randomly unstable. Had a hard locked CPU related to KVM, random segfaults may or may not happen after each reboot while trying to run a command that I know generates them sometimes. Still boggles my mind that it went 7 days with no issue running all kinds of tasks.

No vms running, got these when I went to copy the qcow2 images to the Raid 1 to start backup and rebuild process with another system.

page:fffff89a1dd21880 count:0 mapcount:0 mapping:0000000000000f00 index:0x1

I rebooted and set SOC back to Auto and then copied the 67G worth of qcow2 to the Raid1 no problem. Maybe there is just some kind of voltage regulator problem here I'm not sure. I'll mess with it on the side while I have the hopefully stable replacement up.

I put a load of 21 on it last night via converting some h.265 to h.264 video with ffmpeg, all while running other things in a loop to try to break it, and of course it had zero problems. Running VMs though is a matter of time (and much shorter time lately).

infoveinx avatar Aug 14 '18 23:08 infoveinx

random segfaults may or may not happen after each reboot while trying to run a command that I know generates them sometimes

This sounds familiar... My advice: "If it does not run stable with stock settings, save the time and RMA it."

m-r-s avatar Aug 15 '18 09:08 m-r-s

This sounds familiar... My advice: "If it does not run stable with stock settings, save the time and RMA it."

Unfortunately this is with the RMA CPU. I sent in a 1733PGS and got a 1733SUS back.

infoveinx avatar Aug 15 '18 18:08 infoveinx

I agree, might be time to RMA the motherboard and/or RAM. Maybe try one stick of ram at a time to try to isolate if you have a bad stick.

jstarcher avatar Aug 15 '18 18:08 jstarcher

There is no official information on which CPUs are affected (or not affected). Your description here does fit the ryzen segfault bug. Prime95 and Memtest86 are not (as) sensitive to the bug as this workload. If you hit a segfault that rapidly (less than 5 minutes) and only one or a few processes fail (and some continue running), then your CPU is probably affected. If all processes fail within a short period it may be due to another problem.

You can still check the build logs in /mnt/ramdisk/workdir for problems other than a faulty CPU.

Helo, just a quick question. The memtest86 can be influenced by this bug? Can memtest86 trigger this bug and make the ram look faulty?

v0idwalker avatar Nov 08 '18 00:11 v0idwalker

Theoretically, yes. But one of the particular observations was that memtest86 ran fine on the faulty CPUs while the compilation of GCC failed.

m-r-s avatar Nov 08 '18 09:11 m-r-s

Same experience here. Memtest ran overnight without finding any errrors. This bug requires heavy CPU usages across all threads to trigger which memtest doesn’t do.

That isn’t to say it is impossible for it to cause it to fail though.

jstarcher avatar Nov 08 '18 13:11 jstarcher

Well, I am sending my 1700x for rma. Meanwhile I borrowed a 2600 and will check if the problem persist. Was this bug observed on Zen+ too? (I have an ASRock taichi x470, so there should be no problem with compatibility.

Also, what is the expected final step of this script?

v0idwalker avatar Nov 08 '18 14:11 v0idwalker