gramine Performance Gap Between Logical Core and Physical Core

Description of the problem

We used HammerDB to benchmark the performance of MariaDB in Gramine. We tested on a 2-socket platform, there are 56 physical cores (112 logical cores) per socket. The HammerDB virtual users(VU) is set to 16. The Gamine-MariaDB service is run bind to one socket. If we do no limitation on cores, the HammerDB result NOPM is 270k. The CPU utilization is 1600%. If we limit the cores to 8 physical cores/16 logical cores (8C16T), the HammerDB result NOPM is only 70k, that is much worse. The CPU utilization is also approach to 1600%. As a comparison, MariaDB without Gramine NOPM is 360k in no core limitation and 230k in 8C16T.

I recorded the perf info, it shows that "libos_syscall_futex" occupied much CPU resources on 8C16T scenario.

111

   - 70.89% libos_syscall_entry                                                                                                                                                         
      - 70.45% libos_emulate_syscall                                                                                                                                                    
         - 63.01% libos_syscall_futex                                                                                                                                                   
            - 63.01% _libos_syscall_futex (inlined)                                                                                                                                     
               - 49.28% futex_wait (inlined)                                                                                                                                           
                    35.96% spinlock_lock (inlined)                                                                                                                                      
                  + 7.46% free                                                                                                                                                          
                  + 5.19% create_new_futex                                                                                                                                              
               - 13.71% futex_wake                                                                                                                                                      
                    13.15% spinlock_lock (inlined)                                                                                                                                      
         + 2.83% libos_syscall_clock_gettime                                                                                                                                            
         + 1.57% libos_syscall_poll                                                                                                                                                     
         + 0.83% libos_syscall_write                                                                                                                                                    
           0.58% libos_syscall_pwrite64

Steps to reproduce

The configuration is same as https://github.com/gramineproject/gramine/issues/853

Gramine is on this branch: https://github.com/gramineproject/gramine/tree/borys/test_rm_pal_iov

MariaDB service is bind to socket 0:

 numactl --cpubind=0 --membind=0 gramine-direct  mariadbd --user=root

HammerDB config: Warehouse: 200 VU: 16 With SSL connection

Expected results

8C16T has 75% performance of no core limitation

Sep 14 '22 09:09 qijiax

Yes, we are aware that heavy multi core scenario which uses a lot of futexes currently has some perf bottlenecks around accessing shared AVL tree in futex wake/wait code (basically high lock contention). I started working on it some time ago, but it required some time and had low priority, so the work stalled.

Sep 14 '22 11:09 boryspoplawski

I met the similar issue today.

The latest version gramine on CentOS 8.5, ran MySQL with gramine-sgx (binded on one node with numactl), used sybench oltp_read_write as workload, found that gramine-sgx mysqld only have about 30-40% performance compare to baremetal.

Here is the flame graph, can see the hot spots are the mutexes of malloc and free.

Nov 03 '22 07:11 lianghouxu

Thanks @lianghouxu for the flame graph!

We are aware that libos_syscall_ppoll() (the Gramine emulation for poll family of syscalls) is sub-par. It needs a re-write. But also, it is clear that malloc() and free() in Gramine require a global slab-allocator lock, and this is where the overhead stems from.

Nov 03 '22 09:11 dimakuv

I met the same problem in MySQL. I wonder if we can use C99 variable length array to allocate these short-lived buffers on the stack instead? E.g. https://github.com/gramineproject/gramine/blob/72f65247af61c55c54033c3c6934ed21b607987b/libos/src/sys/libos_poll.c#L51 can be rewritten as:

    long pals_buf[DIV_ROUND_UP(nfds * sizeof(PAL_HANDLE), sizeof(long))];
    PAL_HANDLE* pals = (void *)&pals_buf[0];

Nov 09 '22 13:11 lejunzhu

@lejunzhu nfds can be arbitrarily large.

What we could do is to have logic like this:

if (nfds < SOME_MAX_LIMIT) {
    // new logic
    PAL_HANDLE* pals = alloca(nfds * sizeof(PAL_HANDLE));
    struct fds_mapping_t* fds_mapping = alloca(nfds * sizeof(struct fds_mapping_t));
    pal_wait_flags_t* pal_events = alloca(nfds * sizeof(*pal_events) * 2);
    allocated_on_stack = true;
} else {
    // old logic
    PAL_HANDLE* pals = malloc(nfds * sizeof(PAL_HANDLE));
    struct fds_mapping_t* fds_mapping = malloc(nfds * sizeof(struct fds_mapping_t));
    pal_wait_flags_t* pal_events = malloc(nfds * sizeof(*pal_events) * 2);
    allocated_on_stack = false;
}

... do stuff ...

if (!allocated_on_stack) {
    free(pals);
    free(fds_mapping);
    free(pal_events);
}

The SOME_MAX_LIMIT macro should be chosen such that we don't consume more than e.g. 64KB-128KB of stack (I wonder how many nfds it corresponds to; hopefully more than 10).

Nov 09 '22 15:11 dimakuv

But also, if @lejunzhu wants to implement this, could you just re-structure that whole _libos_syscall_poll() function? It looks pretty ugly and complicated, it can surely be refactored without changing the logic.

This will be at least one step in the right direction for Gramine's poll emulation.

Nov 09 '22 15:11 dimakuv

Could we give this to Borys instead? He did rewrites to other similar functions, so I think it makes more sense if he implemented this change.

Nov 09 '22 16:11 mkow

@mkow I'd be happy but isn't Borys super-tied already with other tasks?

Nov 09 '22 17:11 dimakuv

@lejunzhu nfds can be arbitrarily large.

You are right, I forgot this.

The SOME_MAX_LIMIT macro should be chosen such that we don't consume more than e.g. 64KB-128KB of stack (I wonder how many nfds it corresponds to; hopefully more than 10).

These structures are very small, so I would say the number is way larger than 10. sizeof(PAL_HANDLE): 8 sizeof(struct fds_mapping_t): 16 sizeof(pal_wait_flags_t): 4 sizeof(struct pollfd)): 8

So, I think even using 1K stack space for each malloc will help.

This issue is not very urgent. If a poll() rewrite is already underway, I can wait for that. Otherwise, I can try to patch only the malloc() part along the way (libos_syscall_select -> _libos_syscall_poll -> _PalStreamsWaitEvents). But I'm afraid I don't know this part well enough for any substantial change.

Nov 10 '22 01:11 lejunzhu

So, I think even using 1K stack space for each malloc will help.

Yes, this will be good! From my experience, typically non-network-bound applications use about 5-10-20 FDs in poll, no more than that.

This issue is not very urgent. If a poll() rewrite is already underway, I can wait for that.

No, the poll() rewrite was not yet started. In fact, this task has a very low priority currently. I don't know if @boryspoplawski has anything else to say here.

Nov 10 '22 09:11 dimakuv

1K sounds way to much for a stack buffer

Nov 10 '22 12:11 boryspoplawski

1K sounds way to much for a stack buffer

Do you mean a lower threshold, e.g. 320 bytes for each buffer is acceptable, or this problem should better be addressed with a poll rewrite?

For 20 FDs, we'll need 4 * 160 bytes and 1 * 320 bytes buffers on the stack.

Nov 11 '22 00:11 lejunzhu

I've created a PR #1051 to reduce the lock contention in poll. @dimakuv and @boryspoplawski would you please take a look at it?

@lianghouxu would you please to test this PR as well, to see if it improves MySQL performance in your environment?

Nov 17 '22 09:11 lejunzhu

@lejunzhu I test the PR #1051 , the performance looks good. Before applying the patch, with 120 connections, mysql in gramine-sgx can achieve about 30% of bare metal. After applying the patch, with 120 connections, mysql in gramine-sgx can achieve about 50% of bare metal.

Here is the Flame Graph, the hotspots of malloc and free have gone.

Nov 17 '22 14:11 lianghouxu

From the flame graph, the new bottleneck seems to be get_fd_handle(). It is also reported in #853.

Nov 18 '22 00:11 lejunzhu

I'm closing this issue now as it became a duplicate of #853

Mar 09 '23 14:03 dimakuv