hs-cipher-aes
hs-cipher-aes copied to clipboard
segfault with AES-NI
I'm creating a small utility which uploads a number of files to S3 using aws package and found that it crashed with segfault after making a lot of HTTP requests. I took a look at the stack trace and figured out that it crashed in aes_gen_ctr
.
(gdb) where
#0 0x000000000096495e in aes_gen_ctr ()
#1 0x000000000095c609 in c7ZG_info ()
#2 0x0000000000000000 in ?? ()
Then when I turned off AES-NI by editing cipher-aes.cabal and build the whole program and ran it again, it worked fine.
I'm not sure whether this is relevant to cipher-aes or not, actually I thought it may be something to do with ghc itself, but I don't have any ideas how I investigate further. Do you have any ideas on this?
I built and ran it with ghc-7.6.3 and ghc-7.8.2 on Ubuntu 12.04 LTS running on an EC2 instance which supports AES-NI.
To be clear, your EC2 instance shows aes
in /proc/cpuinfo
? Does gdb show the segfault occuring at an AES-NI instruction?
Yes, it shows aes. Here is an excerpt from /proc/cpuinfo.
flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc up rep_good nopl nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes hypervisor lahf_lm arat epb xsaveopt pln pts dtherm
I don't recall which instruction gdb showed on the segfault. I accidentally overwrote the binary which caused segfault, so I cannot see it now. Let me try reproducing the segfault.
@snakamura next time you do a stacktrace, can you also do a 'disassemble' of the offending function it would be great help for debugging.
For now, I'm not sure what's the reason of this segfault. Is EC2 instance running on AMD cpu maybe ? I haven't tested the AES-NI instructions on it for example, so could explain this. However something that is not consistent and reproducible might means something is quite fishy.
Here is the asm list on segfault. It looks like nothing to do with AES-NI.
+---------------------------------------------------------------------------+
|0x96395d <aes_gen_ctr+13> mov %rsi,%rbp |
|0x963960 <aes_gen_ctr+16> push %rbx |
|0x963961 <aes_gen_ctr+17> mov %rdi,%rbx |
|0x963964 <aes_gen_ctr+20> sub $0x10,%rsp |
|0x963968 <aes_gen_ctr+24> mov (%rdx),%rax |
|0x96396b <aes_gen_ctr+27> mov %rax,(%rsp) |
|0x96396f <aes_gen_ctr+31> mov 0x8(%rdx),%rax |
|0x963973 <aes_gen_ctr+35> mov %rax,0x8(%rsp) |
|0x963978 <aes_gen_ctr+40> cmp %r12,%rbx |
|0x96397b <aes_gen_ctr+43> je 0x9639b0 <aes_gen_ctr+96> |
|0x96397d <aes_gen_ctr+45> movzbl 0x1(%rbp),%eax |
|0x963981 <aes_gen_ctr+49> mov %rbx,%rdi |
|0x963984 <aes_gen_ctr+52> mov %rsp,%rdx |
|0x963987 <aes_gen_ctr+55> mov %rbp,%rsi |
|0x96398a <aes_gen_ctr+58> add $0x10,%rbx |
>|0x96398e <aes_gen_ctr+62> callq *0x18cb318(,%rax,8) |
|0x963995 <aes_gen_ctr+69> mov 0x8(%rsp),%rax |
|0x96399a <aes_gen_ctr+74> bswap %rax |
|0x96399d <aes_gen_ctr+77> add $0x1,%rax |
|0x9639a1 <aes_gen_ctr+81> je 0x9639c0 <aes_gen_ctr+112> |
|0x9639a3 <aes_gen_ctr+83> bswap %rax |
|0x9639a6 <aes_gen_ctr+86> cmp %r12,%rbx |
|0x9639a9 <aes_gen_ctr+89> mov %rax,0x8(%rsp) |
|0x9639ae <aes_gen_ctr+94> jne 0x96397d <aes_gen_ctr+45> |
|0x9639b0 <aes_gen_ctr+96> add $0x10,%rsp |
|0x9639b4 <aes_gen_ctr+100> pop %rbx |
|0x9639b5 <aes_gen_ctr+101> pop %rbp |
|0x9639b6 <aes_gen_ctr+102> pop %r12 |
|0x9639b8 <aes_gen_ctr+104> retq |
|0x9639b9 <aes_gen_ctr+105> nopl 0x0(%rax) |
|0x9639c0 <aes_gen_ctr+112> mov (%rsp),%rax |
|0x9639c4 <aes_gen_ctr+116> movq $0x0,0x8(%rsp) |
|0x9639cd <aes_gen_ctr+125> bswap %rax |
|0x9639d0 <aes_gen_ctr+128> add $0x1,%rax |
|0x9639d4 <aes_gen_ctr+132> bswap %rax |
+---------------------------------------------------------------------------+
multi-thre Thread 0x7fa2d In: aes_gen_ctr Line: ?? PC: 0x96398e
And here is an output of cat /proc/cpuinfo
.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping : 7
microcode : 0x70a
cpu MHz : 1799.999
cache size : 20480 KB
physical id : 1
siblings : 1
core id : 1
cpu cores : 1
apicid : 34
initial apicid : 34
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc up rep_good nopl nonstop_tsc pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes hypervisor lahf_lm arat epb xsaveopt pln pts dtherm
bogomips : 3599.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
It appears as if the branch table has been corrupted or badly initialized. not sure what would cause this. do you have the segfault information from dmesg which looks something like:
segfault at 11 ip 00007f24b21adcee sp 00007fff7379ded0 error 4 in lib.so[7f24b197a000+f6f000]
Yes, it says:
[33773307.954570] storefiles[20603] general protection ip:96398e sp:7fa2d8cfade0 error:0 in storefiles[400000+1228000]
Just wanted to say that we have also just run into this problem in prod when making lots and lots of AWS related HTTPS requests per second. Our gdb output is very similar; I am pasting the information below:
Backtrace:
(gdb) bt
#0 0x0000000003864c83 in aes_gen_ctr_cont ()
#1 0x000000000385d529 in ?? ()
#2 0x0000000000000000 in ?? ()
Disassemble:
(gdb) disassemble
Dump of assembler code for function aes_gen_ctr_cont:
0x0000000003864c40 <+0>: push %r13
0x0000000003864c42 <+2>: mov %ecx,%r13d
0x0000000003864c45 <+5>: shl $0x4,%r13
0x0000000003864c49 <+9>: push %r12
0x0000000003864c4b <+11>: add %rdi,%r13
0x0000000003864c4e <+14>: mov %rdx,%r12
0x0000000003864c51 <+17>: push %rbp
0x0000000003864c52 <+18>: mov %rsi,%rbp
0x0000000003864c55 <+21>: push %rbx
0x0000000003864c56 <+22>: mov %rdi,%rbx
0x0000000003864c59 <+25>: sub $0x18,%rsp
0x0000000003864c5d <+29>: mov (%rdx),%rax
0x0000000003864c60 <+32>: mov %rax,(%rsp)
0x0000000003864c64 <+36>: mov 0x8(%rdx),%rax
0x0000000003864c68 <+40>: mov %rax,0x8(%rsp)
0x0000000003864c6d <+45>: cmp %r13,%rbx
0x0000000003864c70 <+48>: je 0x3864ca5 <aes_gen_ctr_cont+101>
0x0000000003864c72 <+50>: movzbl 0x1(%rbp),%eax
0x0000000003864c76 <+54>: mov %rbx,%rdi
0x0000000003864c79 <+57>: mov %rsp,%rdx
0x0000000003864c7c <+60>: mov %rbp,%rsi
0x0000000003864c7f <+63>: add $0x10,%rbx
=> 0x0000000003864c83 <+67>: callq *0x5d6e678(,%rax,8)
0x0000000003864c8a <+74>: mov 0x8(%rsp),%rax
0x0000000003864c8f <+79>: bswap %rax
0x0000000003864c92 <+82>: add $0x1,%rax
0x0000000003864c96 <+86>: je 0x3864cc8 <aes_gen_ctr_cont+136>
0x0000000003864c98 <+88>: bswap %rax
0x0000000003864c9b <+91>: cmp %r13,%rbx
0x0000000003864c9e <+94>: mov %rax,0x8(%rsp)
0x0000000003864ca3 <+99>: jne 0x3864c72 <aes_gen_ctr_cont+50>
0x0000000003864ca5 <+101>: mov (%rsp),%rax
0x0000000003864ca9 <+105>: mov %rax,(%r12)
0x0000000003864cad <+109>: mov 0x8(%rsp),%rax
0x0000000003864cb2 <+114>: mov %rax,0x8(%r12)
0x0000000003864cb7 <+119>: add $0x18,%rsp
0x0000000003864cbb <+123>: pop %rbx
0x0000000003864cbc <+124>: pop %rbp
0x0000000003864cbd <+125>: pop %r12
0x0000000003864cbf <+127>: pop %r13
0x0000000003864cc1 <+129>: retq
0x0000000003864cc2 <+130>: nopw 0x0(%rax,%rax,1)
0x0000000003864cc8 <+136>: mov (%rsp),%rax
0x0000000003864ccc <+140>: movq $0x0,0x8(%rsp)
0x0000000003864cd5 <+149>: bswap %rax
0x0000000003864cd8 <+152>: add $0x1,%rax
0x0000000003864cdc <+156>: bswap %rax
0x0000000003864cdf <+159>: mov %rax,(%rsp)
0x0000000003864ce3 <+163>: jmp 0x3864c6d <aes_gen_ctr_cont+45>
End of assembler dump.
cpuinfo:
model name : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
stepping : 4
microcode : 0x416
cpu MHz : 2800.050
cache size : 25600 KB
physical id : 1
siblings : 8
core id : 1
cpu cores : 1
apicid : 35
initial apicid : 35
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes rdrand hypervisor lahf_lm ida arat epb pln pts dtherm fsgsbase erms
bogomips : 5600.10
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
stepping : 4
microcode : 0x416
cpu MHz : 2800.050
cache size : 25600 KB
physical id : 1
siblings : 8
core id : 1
cpu cores : 1
apicid : 35
initial apicid : 35
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes rdrand hypervisor lahf_lm ida arat epb pln pts dtherm fsgsbase erms
bogomips : 5600.10
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
stepping : 4
microcode : 0x416
cpu MHz : 2800.050
cache size : 25600 KB
physical id : 1
siblings : 8
core id : 1
cpu cores : 1
apicid : 35
initial apicid : 35
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes rdrand hypervisor lahf_lm ida arat epb pln pts dtherm fsgsbase erms
bogomips : 5600.10
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
We also hit this in production and were able to reproduce it on our development machines, with a similar backtrace:
(gdb) bt
#0 0x0000000000fdc093 in aes_gen_ctr_cont ()
#1 0x0000000000fd4e2b in ?? ()
#2 0x00007f515fdd0ca0 in ?? ()
#3 0x0000000000000000 in ?? ()
(gdb) disassemble
Dump of assembler code for function aes_gen_ctr_cont:
0x0000000000fdc050 <+0>: push %r13
0x0000000000fdc052 <+2>: mov %ecx,%r13d
0x0000000000fdc055 <+5>: shl $0x4,%r13
0x0000000000fdc059 <+9>: push %r12
0x0000000000fdc05b <+11>: add %rdi,%r13
0x0000000000fdc05e <+14>: mov %rdx,%r12
0x0000000000fdc061 <+17>: push %rbp
0x0000000000fdc062 <+18>: mov %rsi,%rbp
0x0000000000fdc065 <+21>: push %rbx
0x0000000000fdc066 <+22>: mov %rdi,%rbx
0x0000000000fdc069 <+25>: sub $0x18,%rsp
0x0000000000fdc06d <+29>: mov (%rdx),%rax
0x0000000000fdc070 <+32>: mov %rax,(%rsp)
0x0000000000fdc074 <+36>: mov 0x8(%rdx),%rax
0x0000000000fdc078 <+40>: mov %rax,0x8(%rsp)
0x0000000000fdc07d <+45>: cmp %r13,%rbx
0x0000000000fdc080 <+48>: je 0xfdc0b5 <aes_gen_ctr_cont+101>
0x0000000000fdc082 <+50>: movzbl 0x1(%rbp),%eax
0x0000000000fdc086 <+54>: mov %rbx,%rdi
0x0000000000fdc089 <+57>: mov %rsp,%rdx
0x0000000000fdc08c <+60>: mov %rbp,%rsi
0x0000000000fdc08f <+63>: add $0x10,%rbx
=> 0x0000000000fdc093 <+67>: callq *0x232f4d8(,%rax,8)
0x0000000000fdc09a <+74>: mov 0x8(%rsp),%rax
0x0000000000fdc09f <+79>: bswap %rax
0x0000000000fdc0a2 <+82>: add $0x1,%rax
0x0000000000fdc0a6 <+86>: je 0xfdc0d8 <aes_gen_ctr_cont+136>
0x0000000000fdc0a8 <+88>: bswap %rax
0x0000000000fdc0ab <+91>: cmp %r13,%rbx
0x0000000000fdc0ae <+94>: mov %rax,0x8(%rsp)
0x0000000000fdc0b3 <+99>: jne 0xfdc082 <aes_gen_ctr_cont+50>
0x0000000000fdc0b5 <+101>: mov (%rsp),%rax
0x0000000000fdc0b9 <+105>: mov %rax,(%r12)
0x0000000000fdc0bd <+109>: mov 0x8(%rsp),%rax
0x0000000000fdc0c2 <+114>: mov %rax,0x8(%r12)
0x0000000000fdc0c7 <+119>: add $0x18,%rsp
0x0000000000fdc0cb <+123>: pop %rbx
0x0000000000fdc0cc <+124>: pop %rbp
0x0000000000fdc0cd <+125>: pop %r12
0x0000000000fdc0cf <+127>: pop %r13
0x0000000000fdc0d1 <+129>: retq
0x0000000000fdc0d2 <+130>: nopw 0x0(%rax,%rax,1)
0x0000000000fdc0d8 <+136>: mov (%rsp),%rax
0x0000000000fdc0dc <+140>: movq $0x0,0x8(%rsp)
0x0000000000fdc0e5 <+149>: bswap %rax
0x0000000000fdc0e8 <+152>: add $0x1,%rax
0x0000000000fdc0ec <+156>: bswap %rax
0x0000000000fdc0ef <+159>: mov %rax,(%rsp)
0x0000000000fdc0f3 <+163>: jmp 0xfdc07d <aes_gen_ctr_cont+45>
It's also not predictably reproducible. I reproduced it on my laptop yesterday but wasn't able to reproduce it when running overnight. In prod, it seems like we hit it on any given machine at least every two hours. It seems to happen when pushing records to Amazon Kinesis.
Is there anything we can do to try to narrow down this problem or help resolve it?
Based on the clues I got from this thread, I wrote a program that just hammers Kinesis with a ton of threads. It crashed in 22 seconds, and I ran it again and it crashed in 5 minutes. This is much more reproducible than anything else I've done so far with this problem. Anyhow, the repository is at https://github.com/SumAll/haskell-overload-kinesis if anyone else wants to try to reproduce it.
I'm curious if the jump table is being corrupted or if the 'strength' field of the key structure is actually being altered somehow. Can anyone capable of running this test add in a check of the strength and print/panic on any erroneous values (> 2 || < 0)?
I haven't modified the code yet, but looking at the core dump, I see
=> 0x0000000000b5acae <+78>: callq *0x18(%rax,%rcx,8)
and
(gdb) i r ecx
ecx 0xa5 165
(%rax is the jump table.) This seems to me like the key strength likely got corrupted.
Sorry, I think the pointer to the key is probably wrong somehow too:
0x0000000000b5ac96 <+54>: movzbl 0x1(%rbx),%ecx
and
(gdb) i r rbx
rbx 0x7f9cef083430 140312001918000
(gdb) x/12 140312001918000
0x7f9cef083430: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
0x7f9cef083440: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
0x7f9cef083450: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
Isn't RBX at that point just the aes_block *output
? Notice how RBX is
being incremented by 16 as well. Don't we want to see RSI (aka RBP)?
2015-01-09 16:46 GMT-08:00 Ethan Glasser-Camp [email protected]:
Sorry, I think the pointer to the key is probably wrong somehow too:
0x0000000000b5ac96 <+54>: movzbl 0x1(%rbx),%ecx
and
(gdb) i r rbx rbx 0x7f9cef083430 140312001918000
(gdb) x/12 140312001918000 0x7f9cef083430: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0x7f9cef083440: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0x7f9cef083450: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
— Reply to this email directly or view it on GitHub https://github.com/vincenthz/hs-cipher-aes/issues/24#issuecomment-69427091 .
Ugh, it's probably obvious that I haven't studied assembler since college.. here's all the registers:
(gdb) info registers
rax 0x1b23f80 28458880
rbx 0x7f9cef083430 140312001918000
rcx 0xa5 165
rdx 0x7fffcfe169d0 140736681044432
rsi 0x7f9cef083430 140312001918000
rdi 0x7f9ced3fe640 140311972013632
rbp 0x7f9ced3fea40 0x7f9ced3fea40
rsp 0x7fffcfe169d0 0x7fffcfe169d0
r8 0x7f9ced3fe640 140311972013632
r9 0x1 1
r10 0x0 0
r11 0x1bb5129 29053225
r12 0x7fffcfe169d0 140736681044432
r13 0x7f9ced3fea60 140311972014688
r14 0x7f9ced3fe650 140311972013648
r15 0x7f9ce92740c0 140311903289536
rip 0xb5acae 0xb5acae <aes_gen_ctr_cont+78>
eflags 0x10206 [ PF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
the reason is that the aes context is getting garbage collected (obvious with the 0xa5 in rcx) during (or before) the generation operation; I didn't find any obvious reason so far, and I already rewrote couple of things (including the aes_gen_ctr_cont used here) that I thought were going to fix this issue without any luck so far.
i've released 0.2.10 putting gen_counter as unsafe (so that it cannot be interrupted, or raced by the GC). I have no idea at this stage that's the main issue here, or what else to do. It only happens on EC2 it seems, so it make it freakingly difficult to test (and at the back of my mind always think that could be a xen issue..).
I really don't get how the GC can get to this piece of memory provided that every call are fenced by touchForeignPtr after call that use the contextes.
I spent some time today trying to get a better understanding of this issue, and I didn't find any answers but I thought I'd share my results.
First, I think we can rule out Xen issues. It doesn't just happen on EC2 instances; I can reliably reproduce it on my laptop by pushing to Kinesis (using the overload-kinesis project, linked above). It seems to take longer to reproduce when I'm on DSL at home then when I'm on a fat pipe at work.
Second, even with 0.2.10 and recompiling my sandbox, I was still able to reproduce it. Sometimes it seemed like it took longer but that could have been my imagination. (I didn't examine the core dump, so it's possible that I messed something up somewhere or confused myself.)
The next thing I considered was that maybe it was something to do with AWS's implementation of TLS that was provoking a bug somewhere in the system. Specifically, I guessed that maybe Kinesis was using an AES cipher whereas other TLS servers weren't. I noticed that using nmap's ssl-enum-ciphers
script on the Kinesis server and the Twitter API server, the Twitter server offers additional ciphers, and if I use openssl's s_client
command, the Kinesis endpoint selects the AES128-SHA cipher.
Because I wanted to try to reproduce this issue faster, I tried to set up an nginx instance that would serve SSL, for example using a self-signed cert, in such a way that I could provoke the bug even without AWS. I reasoned that if I could do this, it would make reproducing the issue much easier and improve the odds of someone playing with it until they figured it out. I've been trying to do this all day using v0.2.9, just in case v0.2.10 does actually fix the bug.
I spent some time messing with nginx configs, self-signing a cert according to these directions, and building out the add-https branch on overload-kinesis. With this I was able to push lots and lots of requests over the loopback interface, and as far as I can tell it's using AES128-SHA the same way the Kinesis endpoint is, but no crash.
Since then I've been trying to eliminate other differences between the Kinesis endpoint, which regularly causes the crash, and my local nginx endpoint, which doesn't. I turned off cert verification on the Kinesis code path, which didn't change anything, and I changed my /etc/hosts file to point to my nginx instance for the Kinesis endpoint. (nginx serves a "canned" response that mirrors the form of a real Kinesis response.) So now I can run overload-kinesis against real Kinesis, which causes it to crash, or against myself, which doesn't cause a crash.
Given all of that, I don't know what's going on. It's possible that my cipher selection doesn't match the Kinesis endpoint, or that there's some subtlety to TLS that I'm not capturing. (Here is the openssl s_client
output for AWS vs. me.)
note, that the thing that consistently fail is the CTR generation, which is exclusively used by the random generation through cprng-aes. the tls protocol doesn't use any CTR mode encryption.
Hmm, and I guess that only happens during TLS handshakes? Maybe some interaction between that and the AES cipher stuff? Or the GC and the TLS handshakes?
I was sharing an HTTP Manager
, which might have kept the connection alive when talking over loopback. I changed the code to use a new manager on each iteration of each thread, but no crash. I verified using the profiler that the tlsEstablish
method gets called 38k times. In fact, here are the profiles and GC stats. local
is just spamming my nginx instance. real
is talking to Kinesis. I let both run about 10 minutes or so of wall-clock time (I couldn't extract a profile if I was going to let the Kinesis one crash).
I guess the thing that most likely got GCed was the RNG, and that this happened just before starting a TLS session? That seems extremely odd, especially since I can't reproduce it when using HTTPS locally. I found bug reports online that suggest using touchForeignPtr, which @vincenthz says he's doing (although I don't see which project that's happening in.. hs-tls? hs-connection?).
Still grasping at straws..
it's automatically done when using withForeignPtr
, in securemem. securemem's finalizer is the one doing the poisoning to the value 0xa5
This is really way deeper than anything I've ever done in Haskell, and I'm learning a lot about GC and about TLS, so thanks for continuing to explain things to me :)
I think the lifetime of the Manager might have something to do with it. This commit ran successfully for four hours. Without this commit, even using one thread, I was able to induce a crash on my laptop in 6 minutes. That's not conclusive, of course, but does this maybe suggest other angles to try? Are there techniques I can use to trace finalization of arbitrary objects? I'm happy to do the legwork here if I'm the only one who's reliably getting the segfault.
The most straightforward link is that after a bit cprng-aes will re-create a new aes context and thus the previous one will be GC.. that should happens after a 1mb of randomness generated.
Thanks for diving into it @glasserc
If that is strongly related then one experiment would be to lower the bound from 1MB to kilobytes and see if the test fails any faster...
Hmm. Well, I did try changing object lifetimes and stuff around, and even changing the aesrngThreshold
and recompiling, but still nothing. I did get one segmentation fault but the backtrace pointed to something starting with stg_
, so I decided that was probably an artifact of reinstalling so many packages in my cabal sandbox, and when I recompiled everything, that seg fault went away. I can confirm, though, that even with 0.2.10 in a clean sandbox, I can still reproduce a seg fault. (Even without -threaded
in my main application's compiler flags.)
I tried reducing the number of blocks to 1 (so only 1k of randomness generated per context) and adding a trace message every time a new context was generated. Even the kinesis "backend" still seems to generate this message quite a few times, even though it only ever speaks to one IP address. It just crashed after 21 trace messages.
It seems to generate a new context about every 8 Kinesis POST events, as well as when it opens a new connection. It just crashed 8 POST events after the previous context generation, just after claiming that it was going to open another connection to the same hostname.
At least I managed to figure out what's calling the RNG -- it's not just as part of TLS handshakes, but every time prepareRecord
gets called. I'm not sure what that means either.
One other thing I figured out was that nginx was only allowing connections to be kept alive for 100 requests because of the default value for keepalive_timeout. It seems not to respect a million requests, but it does respect 10,000. Still no crash. I don't think Kinesis is leaving connections open that long anyhow -- it seems to close them after less than 100.
Unfortunately, this is probably the limit of how much time I can spend trying to get to the bottom of this. I'm doing this mostly on work time and my manager points out that we're OK restarting the process all the time if that's what's necessary. Hope this all helps someone else figure out what's going on..
I've been investigating too, and for the lack of obvious culprit, I've removed the whole codepath and replaced by something else. Try to update connection to 0.2.4 that should solve your problem. Some production code has been running the whole weekend on it, and that seems solid.
Yeah, in my tests this seems solid. It just ran for 3.5 hours without crashing. Any idea why replacing the calls to AES in connection would fix this?