otp icon indicating copy to clipboard operation
otp copied to clipboard

[RFC] Introduce kernel TLS offload in inet_tls_dist

Open zzydxm opened this issue 2 years ago • 24 comments

This could introduce significant performance boost on dist traffic. More discussion is happening in https://github.com/erlang/otp/pull/5840

zzydxm avatar Jun 21 '22 23:06 zzydxm

CT Test Results

       3 files     118 suites   1h 40m 9s :stopwatch: 1 978 tests 1 780 :heavy_check_mark: 198 :zzz: 0 :x: 4 915 runs  3 984 :heavy_check_mark: 931 :zzz: 0 :x:

Results for commit b9af2bd7.

:recycle: This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

github-actions[bot] avatar Jun 21 '22 23:06 github-actions[bot]

At first glance, I think this looks good :) , I have started by enabling testing and will try to make some time for a deeper review during the week. I apologize if progress might be a little slow as it is still vacation time and we want to have everyone's input before merging.

IngelaAndin avatar Aug 01 '22 09:08 IngelaAndin

Turns out we did not have a test machine, running or daily builds, that is able to run ktls with TLS-1.3 out of the box. We are working on a solution.

IngelaAndin avatar Aug 05 '22 06:08 IngelaAndin

We now have a FreeBSD machine that has ktls capabilities. Although the test is still skipped as inet:getopts fails to report the set options. I guess this works in your environment?

@RaimoNiskanen what do you make of this?

Here are some of the commands from init_per_test_case run manually.

[...]
9> inet:setopts(Server, [{raw, 6, 31, <<"tls">>}]).
ok
10> inet:setopts(Server, [{raw, 282, 1, ServerTx}]).
ok
11> inet:setopts(Server, [{raw, 282, 2, ServerRx}]).
ok
12>  inet:setopts(Client, [{raw, 6, 31, <<"tls">>}]).                           
ok                
13> inet:setopts(Client, [{raw, 282, 1, ClientTx}]).
ok
14>  inet:setopts(Client, [{raw, 282, 2, ClientRx}]).
ok
15> inet:getopts(Server, [{raw, 6, 31, 3}]).
{ok,[]}
16> inet:getopts(Server, [{raw, 282, 1, 56}]).
{ok,[]}
17> inet:getopts(Client, [{raw, 6, 31, 3}]).
{ok,[]}
18> inet:getopts(Client, [{raw, 282, 1, 56}]).
{ok,[]}
19> ok = gen_tcp:send(Client, "client").
ok
20> {ok, "client"} = gen_tcp:recv(Server, 6, 1000).
{ok,"client"}
21> ok = gen_tcp:send(Server, "server").
ok
22> {ok, "server"} = gen_tcp:recv(Client, 6, 1000).
{ok,"server"}
23> 

IngelaAndin avatar Aug 09 '22 15:08 IngelaAndin

Right, it works for me on linux 5.12.0. Maybe we can remove the ServerTx/ClientTx check?

2> inet:getopts(Server, [{raw, 6, 31, 3}]).
{ok,[{raw,6,31,<<"tls">>}]}
3> inet:getopts(Server, [{raw, 282, 1, 56}]).
{ok,[{raw,282,1,<<4,3,52,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,...>>}]}
4> inet:getopts(Server, [{raw, 282, 2, 56}]).
{ok,[{raw,282,2,<<4,3,52,0,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,...>>}]}
5> inet:getopts(Client, [{raw, 6, 31, 3}]).
{ok,[{raw,6,31,<<"tls">>}]}
6> inet:getopts(Client, [{raw, 282, 1, 56}]).
{ok,[{raw,282,1,<<4,3,52,0,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,...>>}]}
7> inet:getopts(Client, [{raw, 282, 2, 56}]).
{ok,[{raw,282,2,<<4,3,52,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,...>>}]}

Or maybe TLS_CIPHER_AES_GCM_256 is not supported on the machine?

zzydxm avatar Aug 09 '22 18:08 zzydxm

By the way we have tested socket NIF version ktls dist, it adds 5% more global CPU usage to our prod node compare to the inet_drv version

zzydxm avatar Aug 09 '22 19:08 zzydxm

By the way we have tested socket NIF version ktls dist, it adds 5% more global CPU usage to our prod node compare to the inet_drv version

Is that tested by using inet_backend=socket in inet_tls_dist, or by prying out the socket socket and writing a new inet_tls_dist with direct socket operations?

RaimoNiskanen avatar Aug 11 '22 07:08 RaimoNiskanen

By the way we have tested socket NIF version ktls dist, it adds 5% more global CPU usage to our prod node compare to the inet_drv version

Is that tested by using inet_backend=socket in inet_tls_dist, or by prying out the socket socket and writing a new inet_tls_dist with direct socket operations?

I wrote a new dist module with direct socket operation. But it starts with inet_backend=socket and then use "downgrade"-like operation to shutdown the gen_tcp_socket state machine, partial code:

common_hs_data({'$inet', gen_tcp_socket, {_, SocketRef}}) ->
    Sender = spawn_link(fun() -> sender() end),
    Receiver = spawn_link(fun() -> receiver() end),
    #hs_data{
        this_flags = 0,
        socket = SocketRef,
        f_send = fun(Socket, Msg) ->
            Data = iolist_to_binary(Msg),
            Len = size(Data),
            socket:send(Socket, <<Len:1/unsigned-big-integer-unit:32, Data/binary>>)
        end,
        f_recv = fun(Socket, _Size, _Timeout) ->
            case socket:recv(Socket, 4) of
                {ok, <<Len:1/unsigned-big-integer-unit:32>>} ->
                    case socket:recv(Socket, Len) of
                        {ok, Binary} ->
                            {ok, binary_to_list(Binary)};
                        Error ->
                            Error
                    end;
                Error ->
                    Error
            end
        end,
        f_setopts_pre_nodeup = fun(_) -> ok end,
        f_setopts_post_nodeup = fun(_) -> ok end,
        f_getll = fun(_) -> {ok, Sender} end,
        f_address = fun(Socket, Node) ->
            {ok, #{addr := Address, family := Family, port := Port}} = socket:peername(Socket),
            [_, Host] = binary:split(erlang:atom_to_binary(Node), <<"@">>),
            #net_address{
                address = {Address, Port},
                host = erlang:binary_to_list(Host),
                protocol = socket_ktls,
                family = Family
            }
        end,
        f_handshake_complete = fun(Socket, _Node, DHandle) ->
            Sender ! {handshake_complete, Socket, DHandle, self(), Receiver}
        end,
        mf_tick = fun(_) -> Sender ! tick end,
        mf_getstat = fun(Socket) ->
            #{counters := #{read_pkg := Read, write_pkg := Write}} = socket:info(Socket),
            {ok, Read, Write, 0}
        end
    }.

sender() ->
    receive
        {handshake_complete, Socket, DHandle, DistUtil, Receiver} ->
            erlang:dist_ctrl_input_handler(DHandle, Receiver),
            erlang:dist_ctrl_set_opt(DHandle, get_size, true),
            erlang:dist_ctrl_get_data_notification(DHandle),
            Receiver ! {start_receive, Socket, DHandle, DistUtil},
            sender_loop(Socket, DHandle, DistUtil)
    end.

sender_loop(Socket, DHandle, DistUtil) ->
    receive
        dist_data ->
            case socket:send(Socket, dist_data(DHandle, [], 0)) of
                {error, _} ->
                    DistUtil ! {tcp_closed, Socket},
                    socket:close(Socket);
                _ ->
                    sender_loop(Socket, DHandle, DistUtil)
            end;
        tick ->
            case socket:send(Socket, <<0:1/unsigned-big-integer-unit:32>>) of
                {error, _} ->
                    DistUtil ! {tcp_closed, Socket},
                    socket:close(Socket);
                _ ->
                    sender_loop(Socket, DHandle, DistUtil)
            end;
        {'$gen_call', From, dist_get_tls_socket} ->
            gen:reply(From, {ok, Socket}),
            sender_loop(Socket, DHandle, DistUtil);
        _ ->
            sender_loop(Socket, DHandle, DistUtil)
    end.

dist_data(DHandle, List, Size) ->
    case erlang:dist_ctrl_get_data(DHandle) of
        {Len, Data} when Size < 65536 ->
            dist_data(DHandle, [List, <<Len:1/unsigned-big-integer-unit:32>>, Data], Size + Len + 4);
        {Len, Data} ->
            erlang:dist_ctrl_get_data_notification(DHandle),
            [List, <<Len:1/unsigned-big-integer-unit:32>>, Data];
        none ->
            erlang:dist_ctrl_get_data_notification(DHandle),
            List
    end.

receiver() ->
    receive
        {start_receive, Socket, DHandle, DistUtil} ->
            receiver_loop(Socket, DHandle, DistUtil, [], 0, 0)
    end.

receiver_loop(Socket, DHandle, DistUtil, Buffer, Received, 0) when Received >= 4 ->
    <<Size:1/unsigned-big-integer-unit:32, Rest/binary>> = iolist_to_binary(Buffer),
    receiver_loop(Socket, DHandle, DistUtil, [Rest], Received - 4, Size);
receiver_loop(Socket, DHandle, DistUtil, Buffer, Received, Size) when Size > 0, Received >= Size ->
    <<Data:Size/binary, Rest/binary>> = iolist_to_binary(Buffer),
    erlang:dist_ctrl_put_data(DHandle, Data),
    receiver_loop(Socket, DHandle, DistUtil, [Rest], Received - Size, 0);
receiver_loop(Socket, DHandle, DistUtil, Buffer, Received, Size) ->
    case socket:recv(Socket) of
        {ok, Data} ->
            receiver_loop(Socket, DHandle, DistUtil, [Buffer, Data], Received + size(Data), Size);
        {error, _} ->
            DistUtil ! {tcp_closed, Socket},
            socket:close(Socket)
    end.

set_ktls(Socket, Receiver, Type, ControlPid) ->
    State = sys:replace_state(
        Receiver,
        fun({_, State0}) ->
            gen_tcp:controlling_process(Socket, ControlPid),
            {downgrade, State0}
        end
    ),
    inet:setopts(Socket, [list, {active, false}]),
    gen_statem:stop(Receiver, {shutdown, downgrade}, net_kernel:connecttime()),
    % {_, #state{connection_states = ConnectionStates}} = State,
    ConnectionStates = element(12, element(2, State)),
    CurrentWrite = maps:get(current_write, ConnectionStates),
    CurrentRead = maps:get(current_read, ConnectionStates),
    % #cipher_state{iv = <<WriteSalt:4/bytes, WriteIV:8/bytes>>, key = WriteKey} = maps:get(cipher_state, CurrentWrite),
    % #cipher_state{iv = <<ReadSalt:4/bytes, ReadIV:8/bytes>>, key = ReadKey} = maps:get(cipher_state, CurrentRead),
    {cipher_state, <<WriteSalt:4/bytes, WriteIV:8/bytes>>, WriteKey, _, _, _, _} = maps:get(cipher_state, CurrentWrite),
    {cipher_state, <<ReadSalt:4/bytes, ReadIV:8/bytes>>, ReadKey, _, _, _, _} = maps:get(cipher_state, CurrentRead),
    WriteSeq = maps:get(sequence_number, CurrentWrite),
    ReadSeq = maps:get(sequence_number, CurrentRead),
    % SOL_TCP = 6, TCP_ULP = 31
    inet:setopts(Socket, [{raw, 6, 31, <<"tls">>}]),
    % SOL_TLS = 282, TLS_TX = 1, TLS_RX = 2, TLS_1_3_VERSION = <<4, 3>>, TLS_CIPHER_AES_GCM_256 = <<52, 0>>
    inet:setopts(Socket, [
        {raw, 282, 1, <<4, 3, 52, 0, WriteIV/binary, WriteKey/binary, WriteSalt/binary, WriteSeq:64>>}
    ]),
    inet:setopts(Socket, [{raw, 282, 2, <<4, 3, 52, 0, ReadIV/binary, ReadKey/binary, ReadSalt/binary, ReadSeq:64>>}]),
    {'$inet', gen_tcp_socket, {SocketPid, SocketRef}} = Socket,
    sys:replace_state(
        SocketPid,
        fun({_, State0}) ->
            socket:setopt(SocketRef, {otp, controlling_process}, ControlPid),
            {closed, State0}
        end
    ),
    gen_statem:stop(SocketPid, shutdown, net_kernel:connecttime()),
    socket:setopt(SocketRef, {otp, rcvbuf}, 66000),
    % to avoid client send first application data together with the handshake message
    Type =:= client andalso timer:sleep(1000).

zzydxm avatar Aug 11 '22 07:08 zzydxm

We have managed to get it to work.

Simple and maybe stupid beginner's problem - it may be so obvious to the Linux people that operate in this field that it is not written up front that kTLS is implemented with a kernel module tls that is not loaded by default, so you need to execute modprobe tls as root. Eventually we found that in an Nginx HowTo...

RaimoNiskanen avatar Aug 11 '22 08:08 RaimoNiskanen

Maybe we can remove the ServerTx/ClientTx check?

I think that the check is important to verify that the cipher suite we set is accepted by the kernel. Otherwise it may silently fail and we may accidentally transmit cleartext data.

RaimoNiskanen avatar Aug 11 '22 08:08 RaimoNiskanen

The only improvement of your socket based inet_tls_dist module I can see immediately is that receiver_loop/6 can avoid flattening all data into a binary when parsing the length. Now, if more data is needed to fill the packet, a second flattening will have to be done, and often it would be possible to get away with only one flattening.

RaimoNiskanen avatar Aug 11 '22 08:08 RaimoNiskanen

When I run the ssl_dist_SUITE:ktls_basic test in CommonTest I get this Notice in the console:

=NOTICE REPORT==== 11-Aug-2022::10:42:00.556270 ===
TLS server: In state start at tls_record.erl:564 generated SERVER ALERT: Fatal - Unexpected Message
 - {unsupported_record_type,0}

I get the same in the CommonTest log, intermixed with Info:s from net_kernel:

=INFO REPORT==== 11-Aug-2022::10:45:10.913640 ===
{net_kernel,{auto_connect,ssl_dist_SUITE_ktls_basic_34@elxa89q3m53,
                          {15848102,#Ref<0.3270998642.3550085128.2473>}}}
=NOTICE REPORT==== 11-Aug-2022::10:45:11.085984 ===
TLS server: In state start at tls_record.erl:564 generated SERVER ALERT: Fatal - Unexpected Message
 - {unsupported_record_type,0}
=INFO REPORT==== 11-Aug-2022::10:45:11.093911 ===
{net_kernel,{auto_connect,test_server@elxa89q3m53,
                          {10720637,#Ref<0.3270998642.3550085128.2464>}}}
=INFO REPORT==== 11-Aug-2022::10:45:18.095897 ===
{net_kernel,{'EXIT',<0.118.0>,
                    {ssl_connect_failed,{127,0,1,1},44165,{error,timeout}}}}
=INFO REPORT==== 11-Aug-2022::10:45:18.096078 ===
{net_kernel,{net_kernel,1423,nodedown,test_server@elxa89q3m53}}
=INFO REPORT==== 11-Aug-2022::10:45:18.096294 ===
{net_kernel,{disconnect,test_server@elxa89q3m53}}

Is that Ok, a problem, or possible to avoid?

RaimoNiskanen avatar Aug 11 '22 08:08 RaimoNiskanen

By the way we have tested socket NIF version ktls dist, it adds 5% more global CPU usage to our prod node compare to the inet_drv version

What CPU usage have you measured with inet_drv, vs. with the socket NIF?

RaimoNiskanen avatar Aug 11 '22 12:08 RaimoNiskanen

By the way we have tested socket NIF version ktls dist, it adds 5% more global CPU usage to our prod node compare to the inet_drv version

Is that tested by using inet_backend=socket in inet_tls_dist, or by prying out the socket socket and writing a new inet_tls_dist with direct socket operations?

Meant to ask, what were the values before and after?

IngelaAndin avatar Aug 11 '22 12:08 IngelaAndin

Thanks for the comments! I will work on the changes.

On our most major service node (which handles client connection, and message/request routing to 25k other hosts), CPU utilization is like inet_drv_ktls_dist: 80%, socket_ktls_dist: 85%, inet_tls_dist: 95%

zzydxm avatar Aug 11 '22 16:08 zzydxm

The only improvement of your socket based inet_tls_dist module I can see immediately is that receiver_loop/6 can avoid flattening all data into a binary when parsing the length. Now, if more data is needed to fill the packet, a second flattening will have to be done, and often it would be possible to get away with only one flattening.

I tried many different approaches, but this turns out to be the most efficient one, although it does not look fast. Somehow more pattern matching clauses / <<Size:32, Data:Size/binary, Rest/binary>> match / <<Buffer/binary, Data/binary>> binary concat are all relatively slow.

I do have other options to test though, will keep tuning it.

zzydxm avatar Aug 11 '22 16:08 zzydxm

Regarding socket performance, there is also Erlang Forum thread: https://erlangforums.com/t/high-er-cpu-utilization-with-socket-backend-esp-for-udp/1720

I am also observing some mixed performance data from socket (used directly, not via gen_tcp backend). There might be some opportunities for optimisation. One example is closing listening socket. Somehow, if I accepted >100k sockets from a single listener and then closed the listener, it causes severe (up to a minute) lock contention on resource_monitors and esock.r[NNNNN] (I haven't had time to write the exact test case yet, hope to work on it later).

max-au avatar Aug 11 '22 17:08 max-au

@zzydxm: I am working on an update with my suggested changes above - it became many alterations to set_ktls. Will try to push a commit tomorrow (CEST)

RaimoNiskanen avatar Aug 11 '22 19:08 RaimoNiskanen

@zzydxm: I am working on an update with my suggested changes above - it became many alterations to set_ktls. Will try to push a commit tomorrow (CEST)

Oh really thanks! I will wait for your change first then.

zzydxm avatar Aug 11 '22 21:08 zzydxm

I just pushed my suggestion. Comments are appreciated.

RaimoNiskanen avatar Aug 12 '22 08:08 RaimoNiskanen

These changes looks good to me, thanks a lot for the work!

zzydxm avatar Aug 12 '22 21:08 zzydxm

=NOTICE REPORT==== 11-Aug-2022::10:42:00.556270 === TLS server: In state start at tls_record.erl:564 generated SERVER ALERT: Fatal - Unexpected Message {unsupported_record_type,0}

@RaimoNiskanen Do you still see this? I can't reproduce this issue. There could be several reasons of {unsupported_record_type,0}:

  1. unexpected connection to the listen port which is not ktls_dist protocol
  2. {active, 1} is not done correctly so some data in dist_util handshake was still sent to the ssl receiver (This is unlikely, and will cause failure on dist_util handshake)

zzydxm avatar Aug 12 '22 23:08 zzydxm

There could be several reasons of {unsupported_record_type,0}:

This seems to be nothing to worry about for kTLS. The ssl_dist_SUITE:basic/1 test, that you duplicated but with kTLS instead, shows the same printout. It most probably comes from the test testing that the test_server node, which is not configured for TLS distribution, fails to connect to the test nodes that are. So the test_server node tries to connect using regular TCP and therefore sends something that is misinterpreted by the test node's TLS protocol.

So, it is an expected warning printout.

RaimoNiskanen avatar Aug 15 '22 06:08 RaimoNiskanen

Now we need to figure out why the test ktls_basic/1 fails on some of our test machines...

We have one with kernel 5.4.0 where setting the cipher fails, and sure enough, in /usr/include/linux/tls.h there is only TLS 1.2 and AES_GCM_128. This is an Ubuntu 18.04.6 LTS. It seems we need to bring back the set/get cipher test to init_per_testcase/2.

RaimoNiskanen avatar Aug 15 '22 14:08 RaimoNiskanen

The other machine that failed (PowerPC 64 little-endian, Linux-5.4.0, Ubuntu 20.04.0 LTS) with a timeout - pang after 7 s has now had two successful runs after that

RaimoNiskanen avatar Aug 17 '22 09:08 RaimoNiskanen

I will go on vacation now and therefore not touch this for a little over a week...

RaimoNiskanen avatar Aug 19 '22 13:08 RaimoNiskanen

We are thinking about merging; have you (@zzydxm) got more tuning to do before that?

RaimoNiskanen avatar Sep 01 '22 12:09 RaimoNiskanen

We are thinking about merging; have you (@zzydxm) got more tuning to do before that?

No I didn't do any more tuning on this. Thanks a lot for the work!

zzydxm avatar Sep 06 '22 16:09 zzydxm

Forgot to check test results after returning from my vacation. We have consistent failures on PowerPC (Ubuntu 20.04.4 LTS) that I would like to investigate before merging...

RaimoNiskanen avatar Sep 08 '22 13:09 RaimoNiskanen

Is there any updates on this? Thanks!

zzydxm avatar Sep 26 '22 22:09 zzydxm