hs-cipher-aes icon indicating copy to clipboard operation
hs-cipher-aes copied to clipboard

segfault with AES-NI

Open snakamura opened this issue 10 years ago • 26 comments

I'm creating a small utility which uploads a number of files to S3 using aws package and found that it crashed with segfault after making a lot of HTTP requests. I took a look at the stack trace and figured out that it crashed in aes_gen_ctr.

(gdb) where
#0  0x000000000096495e in aes_gen_ctr ()
#1  0x000000000095c609 in c7ZG_info ()
#2  0x0000000000000000 in ?? ()

Then when I turned off AES-NI by editing cipher-aes.cabal and build the whole program and ran it again, it worked fine.

I'm not sure whether this is relevant to cipher-aes or not, actually I thought it may be something to do with ghc itself, but I don't have any ideas how I investigate further. Do you have any ideas on this?

I built and ran it with ghc-7.6.3 and ghc-7.8.2 on Ubuntu 12.04 LTS running on an EC2 instance which supports AES-NI.

snakamura avatar Jun 02 '14 05:06 snakamura

To be clear, your EC2 instance shows aes in /proc/cpuinfo? Does gdb show the segfault occuring at an AES-NI instruction?

TomMD avatar Jun 02 '14 17:06 TomMD

Yes, it shows aes. Here is an excerpt from /proc/cpuinfo.

flags           : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc up rep_good nopl nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes hypervisor lahf_lm arat epb xsaveopt pln pts dtherm

I don't recall which instruction gdb showed on the segfault. I accidentally overwrote the binary which caused segfault, so I cannot see it now. Let me try reproducing the segfault.

snakamura avatar Jun 03 '14 04:06 snakamura

@snakamura next time you do a stacktrace, can you also do a 'disassemble' of the offending function it would be great help for debugging.

For now, I'm not sure what's the reason of this segfault. Is EC2 instance running on AMD cpu maybe ? I haven't tested the AES-NI instructions on it for example, so could explain this. However something that is not consistent and reproducible might means something is quite fishy.

vincenthz avatar Jun 03 '14 12:06 vincenthz

Here is the asm list on segfault. It looks like nothing to do with AES-NI.

   +---------------------------------------------------------------------------+
   |0x96395d <aes_gen_ctr+13>       mov    %rsi,%rbp                           |
   |0x963960 <aes_gen_ctr+16>       push   %rbx                                |
   |0x963961 <aes_gen_ctr+17>       mov    %rdi,%rbx                           |
   |0x963964 <aes_gen_ctr+20>       sub    $0x10,%rsp                          |
   |0x963968 <aes_gen_ctr+24>       mov    (%rdx),%rax                         |
   |0x96396b <aes_gen_ctr+27>       mov    %rax,(%rsp)                         |
   |0x96396f <aes_gen_ctr+31>       mov    0x8(%rdx),%rax                      |
   |0x963973 <aes_gen_ctr+35>       mov    %rax,0x8(%rsp)                      |
   |0x963978 <aes_gen_ctr+40>       cmp    %r12,%rbx                           |
   |0x96397b <aes_gen_ctr+43>       je     0x9639b0 <aes_gen_ctr+96>           |
   |0x96397d <aes_gen_ctr+45>       movzbl 0x1(%rbp),%eax                      |
   |0x963981 <aes_gen_ctr+49>       mov    %rbx,%rdi                           |
   |0x963984 <aes_gen_ctr+52>       mov    %rsp,%rdx                           |
   |0x963987 <aes_gen_ctr+55>       mov    %rbp,%rsi                           |
   |0x96398a <aes_gen_ctr+58>       add    $0x10,%rbx                          |
  >|0x96398e <aes_gen_ctr+62>       callq  *0x18cb318(,%rax,8)                 |
   |0x963995 <aes_gen_ctr+69>       mov    0x8(%rsp),%rax                      |
   |0x96399a <aes_gen_ctr+74>       bswap  %rax                                |
   |0x96399d <aes_gen_ctr+77>       add    $0x1,%rax                           |
   |0x9639a1 <aes_gen_ctr+81>       je     0x9639c0 <aes_gen_ctr+112>          |
   |0x9639a3 <aes_gen_ctr+83>       bswap  %rax                                |
   |0x9639a6 <aes_gen_ctr+86>       cmp    %r12,%rbx                           |
   |0x9639a9 <aes_gen_ctr+89>       mov    %rax,0x8(%rsp)                      |
   |0x9639ae <aes_gen_ctr+94>       jne    0x96397d <aes_gen_ctr+45>           |
   |0x9639b0 <aes_gen_ctr+96>       add    $0x10,%rsp                          |
   |0x9639b4 <aes_gen_ctr+100>      pop    %rbx                                |
   |0x9639b5 <aes_gen_ctr+101>      pop    %rbp                                |
   |0x9639b6 <aes_gen_ctr+102>      pop    %r12                                |
   |0x9639b8 <aes_gen_ctr+104>      retq                                       |
   |0x9639b9 <aes_gen_ctr+105>      nopl   0x0(%rax)                           |
   |0x9639c0 <aes_gen_ctr+112>      mov    (%rsp),%rax                         |
   |0x9639c4 <aes_gen_ctr+116>      movq   $0x0,0x8(%rsp)                      |
   |0x9639cd <aes_gen_ctr+125>      bswap  %rax                                |
   |0x9639d0 <aes_gen_ctr+128>      add    $0x1,%rax                           |
   |0x9639d4 <aes_gen_ctr+132>      bswap  %rax                                |
   +---------------------------------------------------------------------------+
multi-thre Thread 0x7fa2d In: aes_gen_ctr               Line: ??   PC: 0x96398e 

And here is an output of cat /proc/cpuinfo.

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping        : 7
microcode       : 0x70a
cpu MHz         : 1799.999
cache size      : 20480 KB
physical id     : 1
siblings        : 1
core id         : 1
cpu cores       : 1
apicid          : 34
initial apicid  : 34
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc up rep_good nopl nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes hypervisor lahf_lm arat epb xsaveopt pln pts dtherm
bogomips        : 3599.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

snakamura avatar Jun 04 '14 00:06 snakamura

It appears as if the branch table has been corrupted or badly initialized. not sure what would cause this. do you have the segfault information from dmesg which looks something like:

segfault at 11 ip 00007f24b21adcee sp 00007fff7379ded0 error 4 in lib.so[7f24b197a000+f6f000]

vincenthz avatar Jun 04 '14 08:06 vincenthz

Yes, it says:

[33773307.954570] storefiles[20603] general protection ip:96398e sp:7fa2d8cfade0 error:0 in storefiles[400000+1228000]

snakamura avatar Jun 04 '14 08:06 snakamura

Just wanted to say that we have also just run into this problem in prod when making lots and lots of AWS related HTTPS requests per second. Our gdb output is very similar; I am pasting the information below:

Backtrace:

(gdb) bt
#0  0x0000000003864c83 in aes_gen_ctr_cont ()
#1  0x000000000385d529 in ?? ()
#2  0x0000000000000000 in ?? ()

Disassemble:

(gdb) disassemble
Dump of assembler code for function aes_gen_ctr_cont:
   0x0000000003864c40 <+0>:     push   %r13
   0x0000000003864c42 <+2>:     mov    %ecx,%r13d
   0x0000000003864c45 <+5>:     shl    $0x4,%r13
   0x0000000003864c49 <+9>:     push   %r12
   0x0000000003864c4b <+11>:    add    %rdi,%r13
   0x0000000003864c4e <+14>:    mov    %rdx,%r12
   0x0000000003864c51 <+17>:    push   %rbp
   0x0000000003864c52 <+18>:    mov    %rsi,%rbp
   0x0000000003864c55 <+21>:    push   %rbx
   0x0000000003864c56 <+22>:    mov    %rdi,%rbx
   0x0000000003864c59 <+25>:    sub    $0x18,%rsp
   0x0000000003864c5d <+29>:    mov    (%rdx),%rax
   0x0000000003864c60 <+32>:    mov    %rax,(%rsp)
   0x0000000003864c64 <+36>:    mov    0x8(%rdx),%rax
   0x0000000003864c68 <+40>:    mov    %rax,0x8(%rsp)
   0x0000000003864c6d <+45>:    cmp    %r13,%rbx
   0x0000000003864c70 <+48>:    je     0x3864ca5 <aes_gen_ctr_cont+101>
   0x0000000003864c72 <+50>:    movzbl 0x1(%rbp),%eax
   0x0000000003864c76 <+54>:    mov    %rbx,%rdi
   0x0000000003864c79 <+57>:    mov    %rsp,%rdx
   0x0000000003864c7c <+60>:    mov    %rbp,%rsi
   0x0000000003864c7f <+63>:    add    $0x10,%rbx
=> 0x0000000003864c83 <+67>:    callq  *0x5d6e678(,%rax,8)
   0x0000000003864c8a <+74>:    mov    0x8(%rsp),%rax
   0x0000000003864c8f <+79>:    bswap  %rax
   0x0000000003864c92 <+82>:    add    $0x1,%rax
   0x0000000003864c96 <+86>:    je     0x3864cc8 <aes_gen_ctr_cont+136>
   0x0000000003864c98 <+88>:    bswap  %rax
   0x0000000003864c9b <+91>:    cmp    %r13,%rbx
   0x0000000003864c9e <+94>:    mov    %rax,0x8(%rsp)
   0x0000000003864ca3 <+99>:    jne    0x3864c72 <aes_gen_ctr_cont+50>
   0x0000000003864ca5 <+101>:   mov    (%rsp),%rax
   0x0000000003864ca9 <+105>:   mov    %rax,(%r12)
   0x0000000003864cad <+109>:   mov    0x8(%rsp),%rax
   0x0000000003864cb2 <+114>:   mov    %rax,0x8(%r12)
   0x0000000003864cb7 <+119>:   add    $0x18,%rsp
   0x0000000003864cbb <+123>:   pop    %rbx
   0x0000000003864cbc <+124>:   pop    %rbp
   0x0000000003864cbd <+125>:   pop    %r12
   0x0000000003864cbf <+127>:   pop    %r13
   0x0000000003864cc1 <+129>:   retq
   0x0000000003864cc2 <+130>:   nopw   0x0(%rax,%rax,1)
   0x0000000003864cc8 <+136>:   mov    (%rsp),%rax
   0x0000000003864ccc <+140>:   movq   $0x0,0x8(%rsp)
   0x0000000003864cd5 <+149>:   bswap  %rax
   0x0000000003864cd8 <+152>:   add    $0x1,%rax
   0x0000000003864cdc <+156>:   bswap  %rax
   0x0000000003864cdf <+159>:   mov    %rax,(%rsp)
   0x0000000003864ce3 <+163>:   jmp    0x3864c6d <aes_gen_ctr_cont+45>
End of assembler dump.

cpuinfo:

model name      : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
stepping        : 4
microcode       : 0x416
cpu MHz         : 2800.050
cache size      : 25600 KB
physical id     : 1
siblings        : 8
core id         : 1
cpu cores       : 1
apicid          : 35
initial apicid  : 35
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes rdrand hypervisor lahf_lm ida arat epb pln pts dtherm fsgsbase erms
bogomips        : 5600.10
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
stepping        : 4
microcode       : 0x416
cpu MHz         : 2800.050
cache size      : 25600 KB
physical id     : 1
siblings        : 8
core id         : 1
cpu cores       : 1
apicid          : 35
initial apicid  : 35
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes rdrand hypervisor lahf_lm ida arat epb pln pts dtherm fsgsbase erms
bogomips        : 5600.10
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
stepping        : 4
microcode       : 0x416
cpu MHz         : 2800.050
cache size      : 25600 KB
physical id     : 1
siblings        : 8
core id         : 1
cpu cores       : 1
apicid          : 35
initial apicid  : 35
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes rdrand hypervisor lahf_lm ida arat epb pln pts dtherm fsgsbase erms
bogomips        : 5600.10
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

ozataman avatar Dec 22 '14 17:12 ozataman

We also hit this in production and were able to reproduce it on our development machines, with a similar backtrace:

(gdb) bt
#0  0x0000000000fdc093 in aes_gen_ctr_cont ()
#1  0x0000000000fd4e2b in ?? ()
#2  0x00007f515fdd0ca0 in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb) disassemble
Dump of assembler code for function aes_gen_ctr_cont:
   0x0000000000fdc050 <+0>:     push   %r13
   0x0000000000fdc052 <+2>:     mov    %ecx,%r13d
   0x0000000000fdc055 <+5>:     shl    $0x4,%r13
   0x0000000000fdc059 <+9>:     push   %r12
   0x0000000000fdc05b <+11>:    add    %rdi,%r13
   0x0000000000fdc05e <+14>:    mov    %rdx,%r12
   0x0000000000fdc061 <+17>:    push   %rbp
   0x0000000000fdc062 <+18>:    mov    %rsi,%rbp
   0x0000000000fdc065 <+21>:    push   %rbx
   0x0000000000fdc066 <+22>:    mov    %rdi,%rbx
   0x0000000000fdc069 <+25>:    sub    $0x18,%rsp
   0x0000000000fdc06d <+29>:    mov    (%rdx),%rax
   0x0000000000fdc070 <+32>:    mov    %rax,(%rsp)
   0x0000000000fdc074 <+36>:    mov    0x8(%rdx),%rax
   0x0000000000fdc078 <+40>:    mov    %rax,0x8(%rsp)
   0x0000000000fdc07d <+45>:    cmp    %r13,%rbx
   0x0000000000fdc080 <+48>:    je     0xfdc0b5 <aes_gen_ctr_cont+101>
   0x0000000000fdc082 <+50>:    movzbl 0x1(%rbp),%eax
   0x0000000000fdc086 <+54>:    mov    %rbx,%rdi
   0x0000000000fdc089 <+57>:    mov    %rsp,%rdx
   0x0000000000fdc08c <+60>:    mov    %rbp,%rsi
   0x0000000000fdc08f <+63>:    add    $0x10,%rbx
=> 0x0000000000fdc093 <+67>:    callq  *0x232f4d8(,%rax,8)
   0x0000000000fdc09a <+74>:    mov    0x8(%rsp),%rax
   0x0000000000fdc09f <+79>:    bswap  %rax
   0x0000000000fdc0a2 <+82>:    add    $0x1,%rax
   0x0000000000fdc0a6 <+86>:    je     0xfdc0d8 <aes_gen_ctr_cont+136>
   0x0000000000fdc0a8 <+88>:    bswap  %rax
   0x0000000000fdc0ab <+91>:    cmp    %r13,%rbx
   0x0000000000fdc0ae <+94>:    mov    %rax,0x8(%rsp)
   0x0000000000fdc0b3 <+99>:    jne    0xfdc082 <aes_gen_ctr_cont+50>
   0x0000000000fdc0b5 <+101>:   mov    (%rsp),%rax
   0x0000000000fdc0b9 <+105>:   mov    %rax,(%r12)
   0x0000000000fdc0bd <+109>:   mov    0x8(%rsp),%rax
   0x0000000000fdc0c2 <+114>:   mov    %rax,0x8(%r12)
   0x0000000000fdc0c7 <+119>:   add    $0x18,%rsp
   0x0000000000fdc0cb <+123>:   pop    %rbx
   0x0000000000fdc0cc <+124>:   pop    %rbp
   0x0000000000fdc0cd <+125>:   pop    %r12
   0x0000000000fdc0cf <+127>:   pop    %r13
   0x0000000000fdc0d1 <+129>:   retq   
   0x0000000000fdc0d2 <+130>:   nopw   0x0(%rax,%rax,1)
   0x0000000000fdc0d8 <+136>:   mov    (%rsp),%rax
   0x0000000000fdc0dc <+140>:   movq   $0x0,0x8(%rsp)
   0x0000000000fdc0e5 <+149>:   bswap  %rax
   0x0000000000fdc0e8 <+152>:   add    $0x1,%rax
   0x0000000000fdc0ec <+156>:   bswap  %rax
   0x0000000000fdc0ef <+159>:   mov    %rax,(%rsp)
   0x0000000000fdc0f3 <+163>:   jmp    0xfdc07d <aes_gen_ctr_cont+45>

It's also not predictably reproducible. I reproduced it on my laptop yesterday but wasn't able to reproduce it when running overnight. In prod, it seems like we hit it on any given machine at least every two hours. It seems to happen when pushing records to Amazon Kinesis.

Is there anything we can do to try to narrow down this problem or help resolve it?

glasserc avatar Jan 09 '15 19:01 glasserc

Based on the clues I got from this thread, I wrote a program that just hammers Kinesis with a ton of threads. It crashed in 22 seconds, and I ran it again and it crashed in 5 minutes. This is much more reproducible than anything else I've done so far with this problem. Anyhow, the repository is at https://github.com/SumAll/haskell-overload-kinesis if anyone else wants to try to reproduce it.

glasserc avatar Jan 09 '15 21:01 glasserc

I'm curious if the jump table is being corrupted or if the 'strength' field of the key structure is actually being altered somehow. Can anyone capable of running this test add in a check of the strength and print/panic on any erroneous values (> 2 || < 0)?

TomMD avatar Jan 09 '15 23:01 TomMD

I haven't modified the code yet, but looking at the core dump, I see

=> 0x0000000000b5acae <+78>:    callq  *0x18(%rax,%rcx,8)

and

(gdb) i r ecx
ecx            0xa5     165

(%rax is the jump table.) This seems to me like the key strength likely got corrupted.

glasserc avatar Jan 10 '15 00:01 glasserc

Sorry, I think the pointer to the key is probably wrong somehow too:

0x0000000000b5ac96 <+54>:    movzbl 0x1(%rbx),%ecx

and

(gdb) i r rbx
rbx            0x7f9cef083430   140312001918000
(gdb) x/12 140312001918000
0x7f9cef083430: 0xa5a5a5a5      0xa5a5a5a5      0xa5a5a5a5      0xa5a5a5a5
0x7f9cef083440: 0xa5a5a5a5      0xa5a5a5a5      0xa5a5a5a5      0xa5a5a5a5
0x7f9cef083450: 0xa5a5a5a5      0xa5a5a5a5      0xa5a5a5a5      0xa5a5a5a5

glasserc avatar Jan 10 '15 00:01 glasserc

Isn't RBX at that point just the aes_block *output? Notice how RBX is being incremented by 16 as well. Don't we want to see RSI (aka RBP)?

2015-01-09 16:46 GMT-08:00 Ethan Glasser-Camp [email protected]:

Sorry, I think the pointer to the key is probably wrong somehow too:

0x0000000000b5ac96 <+54>: movzbl 0x1(%rbx),%ecx

and

(gdb) i r rbx rbx 0x7f9cef083430 140312001918000

(gdb) x/12 140312001918000 0x7f9cef083430: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0x7f9cef083440: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0x7f9cef083450: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5

— Reply to this email directly or view it on GitHub https://github.com/vincenthz/hs-cipher-aes/issues/24#issuecomment-69427091 .

TomMD avatar Jan 10 '15 01:01 TomMD

Ugh, it's probably obvious that I haven't studied assembler since college.. here's all the registers:

(gdb) info registers
rax            0x1b23f80        28458880
rbx            0x7f9cef083430   140312001918000
rcx            0xa5     165
rdx            0x7fffcfe169d0   140736681044432
rsi            0x7f9cef083430   140312001918000
rdi            0x7f9ced3fe640   140311972013632
rbp            0x7f9ced3fea40   0x7f9ced3fea40
rsp            0x7fffcfe169d0   0x7fffcfe169d0
r8             0x7f9ced3fe640   140311972013632
r9             0x1      1
r10            0x0      0
r11            0x1bb5129        29053225
r12            0x7fffcfe169d0   140736681044432
r13            0x7f9ced3fea60   140311972014688
r14            0x7f9ced3fe650   140311972013648
r15            0x7f9ce92740c0   140311903289536
rip            0xb5acae 0xb5acae <aes_gen_ctr_cont+78>
eflags         0x10206  [ PF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

glasserc avatar Jan 10 '15 07:01 glasserc

the reason is that the aes context is getting garbage collected (obvious with the 0xa5 in rcx) during (or before) the generation operation; I didn't find any obvious reason so far, and I already rewrote couple of things (including the aes_gen_ctr_cont used here) that I thought were going to fix this issue without any luck so far.

vincenthz avatar Jan 10 '15 08:01 vincenthz

i've released 0.2.10 putting gen_counter as unsafe (so that it cannot be interrupted, or raced by the GC). I have no idea at this stage that's the main issue here, or what else to do. It only happens on EC2 it seems, so it make it freakingly difficult to test (and at the back of my mind always think that could be a xen issue..).

I really don't get how the GC can get to this piece of memory provided that every call are fenced by touchForeignPtr after call that use the contextes.

vincenthz avatar Jan 10 '15 15:01 vincenthz

I spent some time today trying to get a better understanding of this issue, and I didn't find any answers but I thought I'd share my results.

First, I think we can rule out Xen issues. It doesn't just happen on EC2 instances; I can reliably reproduce it on my laptop by pushing to Kinesis (using the overload-kinesis project, linked above). It seems to take longer to reproduce when I'm on DSL at home then when I'm on a fat pipe at work.

Second, even with 0.2.10 and recompiling my sandbox, I was still able to reproduce it. Sometimes it seemed like it took longer but that could have been my imagination. (I didn't examine the core dump, so it's possible that I messed something up somewhere or confused myself.)

The next thing I considered was that maybe it was something to do with AWS's implementation of TLS that was provoking a bug somewhere in the system. Specifically, I guessed that maybe Kinesis was using an AES cipher whereas other TLS servers weren't. I noticed that using nmap's ssl-enum-ciphers script on the Kinesis server and the Twitter API server, the Twitter server offers additional ciphers, and if I use openssl's s_client command, the Kinesis endpoint selects the AES128-SHA cipher.

Because I wanted to try to reproduce this issue faster, I tried to set up an nginx instance that would serve SSL, for example using a self-signed cert, in such a way that I could provoke the bug even without AWS. I reasoned that if I could do this, it would make reproducing the issue much easier and improve the odds of someone playing with it until they figured it out. I've been trying to do this all day using v0.2.9, just in case v0.2.10 does actually fix the bug.

I spent some time messing with nginx configs, self-signing a cert according to these directions, and building out the add-https branch on overload-kinesis. With this I was able to push lots and lots of requests over the loopback interface, and as far as I can tell it's using AES128-SHA the same way the Kinesis endpoint is, but no crash.

Since then I've been trying to eliminate other differences between the Kinesis endpoint, which regularly causes the crash, and my local nginx endpoint, which doesn't. I turned off cert verification on the Kinesis code path, which didn't change anything, and I changed my /etc/hosts file to point to my nginx instance for the Kinesis endpoint. (nginx serves a "canned" response that mirrors the form of a real Kinesis response.) So now I can run overload-kinesis against real Kinesis, which causes it to crash, or against myself, which doesn't cause a crash.

Given all of that, I don't know what's going on. It's possible that my cipher selection doesn't match the Kinesis endpoint, or that there's some subtlety to TLS that I'm not capturing. (Here is the openssl s_client output for AWS vs. me.)

glasserc avatar Jan 11 '15 02:01 glasserc

note, that the thing that consistently fail is the CTR generation, which is exclusively used by the random generation through cprng-aes. the tls protocol doesn't use any CTR mode encryption.

vincenthz avatar Jan 11 '15 06:01 vincenthz

Hmm, and I guess that only happens during TLS handshakes? Maybe some interaction between that and the AES cipher stuff? Or the GC and the TLS handshakes?

I was sharing an HTTP Manager, which might have kept the connection alive when talking over loopback. I changed the code to use a new manager on each iteration of each thread, but no crash. I verified using the profiler that the tlsEstablish method gets called 38k times. In fact, here are the profiles and GC stats. local is just spamming my nginx instance. real is talking to Kinesis. I let both run about 10 minutes or so of wall-clock time (I couldn't extract a profile if I was going to let the Kinesis one crash).

I guess the thing that most likely got GCed was the RNG, and that this happened just before starting a TLS session? That seems extremely odd, especially since I can't reproduce it when using HTTPS locally. I found bug reports online that suggest using touchForeignPtr, which @vincenthz says he's doing (although I don't see which project that's happening in.. hs-tls? hs-connection?).

Still grasping at straws..

glasserc avatar Jan 11 '15 20:01 glasserc

it's automatically done when using withForeignPtr, in securemem. securemem's finalizer is the one doing the poisoning to the value 0xa5

vincenthz avatar Jan 11 '15 22:01 vincenthz

This is really way deeper than anything I've ever done in Haskell, and I'm learning a lot about GC and about TLS, so thanks for continuing to explain things to me :)

I think the lifetime of the Manager might have something to do with it. This commit ran successfully for four hours. Without this commit, even using one thread, I was able to induce a crash on my laptop in 6 minutes. That's not conclusive, of course, but does this maybe suggest other angles to try? Are there techniques I can use to trace finalization of arbitrary objects? I'm happy to do the legwork here if I'm the only one who's reliably getting the segfault.

glasserc avatar Jan 12 '15 01:01 glasserc

The most straightforward link is that after a bit cprng-aes will re-create a new aes context and thus the previous one will be GC.. that should happens after a 1mb of randomness generated.

Thanks for diving into it @glasserc

vincenthz avatar Jan 12 '15 17:01 vincenthz

If that is strongly related then one experiment would be to lower the bound from 1MB to kilobytes and see if the test fails any faster...

TomMD avatar Jan 13 '15 01:01 TomMD

Hmm. Well, I did try changing object lifetimes and stuff around, and even changing the aesrngThreshold and recompiling, but still nothing. I did get one segmentation fault but the backtrace pointed to something starting with stg_, so I decided that was probably an artifact of reinstalling so many packages in my cabal sandbox, and when I recompiled everything, that seg fault went away. I can confirm, though, that even with 0.2.10 in a clean sandbox, I can still reproduce a seg fault. (Even without -threaded in my main application's compiler flags.)

I tried reducing the number of blocks to 1 (so only 1k of randomness generated per context) and adding a trace message every time a new context was generated. Even the kinesis "backend" still seems to generate this message quite a few times, even though it only ever speaks to one IP address. It just crashed after 21 trace messages.

It seems to generate a new context about every 8 Kinesis POST events, as well as when it opens a new connection. It just crashed 8 POST events after the previous context generation, just after claiming that it was going to open another connection to the same hostname.

At least I managed to figure out what's calling the RNG -- it's not just as part of TLS handshakes, but every time prepareRecord gets called. I'm not sure what that means either.

One other thing I figured out was that nginx was only allowing connections to be kept alive for 100 requests because of the default value for keepalive_timeout. It seems not to respect a million requests, but it does respect 10,000. Still no crash. I don't think Kinesis is leaving connections open that long anyhow -- it seems to close them after less than 100.

Unfortunately, this is probably the limit of how much time I can spend trying to get to the bottom of this. I'm doing this mostly on work time and my manager points out that we're OK restarting the process all the time if that's what's necessary. Hope this all helps someone else figure out what's going on..

glasserc avatar Jan 20 '15 03:01 glasserc

I've been investigating too, and for the lack of obvious culprit, I've removed the whole codepath and replaced by something else. Try to update connection to 0.2.4 that should solve your problem. Some production code has been running the whole weekend on it, and that seems solid.

vincenthz avatar Jan 20 '15 04:01 vincenthz

Yeah, in my tests this seems solid. It just ran for 3.5 hours without crashing. Any idea why replacing the calls to AES in connection would fix this?

glasserc avatar Jan 21 '15 01:01 glasserc