exqlite icon indicating copy to clipboard operation
exqlite copied to clipboard

Segfault when exceeding DBConnection timeout

Open LostKobrakai opened this issue 3 years ago • 12 comments

I was debugging a sudden reboot on our nerves system with a expensive query. I had the memory usage and OOM in mind, but today I found out that raising the timeout actually makes the query succeed (no reboot). Making the timout be exceeded on my local machine I could reproduce the issue and even get the seg fault message (see below). I don't think the query matters for as long as the timeout is exceeded.

iex(6)> Task.start(fn -> Repo.checkout(fn -> Ecto.Adapters.SQL.query!(Repo, query) end, timeout: :timer.seconds(1)) end)
{:ok, #PID<0.2359.0>}
iex(7)> 11:24:58.427 [error] Exqlite.Connection (#PID<0.2164.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.2359.0> timed out because it queued and checked out the connection for longer than 1000ms

#PID<0.2359.0> was at location:

    (exqlite 0.8.6) lib/exqlite/sqlite3.ex:91: Exqlite.Sqlite3.multi_step/3
    (exqlite 0.8.6) lib/exqlite/sqlite3.ex:138: Exqlite.Sqlite3.fetch_all/4
    (exqlite 0.8.6) lib/exqlite/connection.ex:529: Exqlite.Connection.get_rows/2
    (exqlite 0.8.6) lib/exqlite/connection.ex:475: Exqlite.Connection.execute/4
    (db_connection 2.4.1) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.4.1) lib/db_connection.ex:1333: DBConnection.run_execute/5
    (db_connection 2.4.1) lib/db_connection.ex:650: DBConnection.execute/4

11:25:00.283 [debug] …
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[os_mon] memory supervisor port (memsup): Erlang has closed
[1]    37300 segmentation fault  iex -S mix

LostKobrakai avatar Jan 28 '22 10:01 LostKobrakai

I just saw this one, which is probably related: https://github.com/elixir-sqlite/exqlite/issues/127

LostKobrakai avatar Jan 31 '22 13:01 LostKobrakai

I am still seeing the segfault described above in my app.

19:21:45.978 [error] Exqlite.Connection (#PID<0.2619.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.2866.0> timed out because it queued and checked out the connection for longer than 15000m
signal: segmentation fault

Is there a way to increase this timeout? I don't think any of the timeout settings are affecting it for me. Any way I can help to debug it?

SebastianSzturo avatar Jun 08 '22 19:06 SebastianSzturo

Which version are you running?

Do you have a reliable way to reproduce the segfault?

warmwaffles avatar Jun 08 '22 20:06 warmwaffles

I am running on master. I'll try to reproduce it in a test file to share and report back 👍🏻

SebastianSzturo avatar Jun 09 '22 02:06 SebastianSzturo

Experiencing the same issue. In our case, we've set DBConnection :timeout and Exqlite :busy_timeout to the same value. Could a potential cause be that the DBConnection execute times out and terminates Exqlite before it has an opportunity to free up memory? Then on the subsequent connection creation, Exqlite attempts to allocate memory on the wrong block of data?

keatz55 avatar Sep 22 '22 19:09 keatz55

Could a potential cause be that the DBConnection execute times out and terminates Exqlite before it has an opportunity to free up memory?

So as it stands right now with the DirtyNIF implementation, the DBConnection will crash but the query running in the background never actually gets terminated. If you are operating on a large database it is quite possible that your query would continue to run even if the connection timed out.

I have been meaning to find time to swap out the DirtyNIF model for a runtime that can cancel running queries.

Then on the subsequent connection creation, Exqlite attempts to allocate memory on the wrong block of data?

We are using enif_alloc/1 and enif_free/1 under the hood for all sqlite memory allocations. This is so that when you look at the live dashboard or query the VM you can see accurate memory usage by the erlang process. Otherwise the bytes that sqlite uses would not be tracked.

https://github.com/elixir-sqlite/exqlite/blob/2c9fa05b64cab9a5470adeae2e317f1480fbd0ca/c_src/sqlite3_nif.c#L32-L107

It is entirely possible that I encoded a bug there, but I don't think that is the case. The first sizeof(int) of the bytes allocated are reserved for the block size of memory.

warmwaffles avatar Sep 22 '22 22:09 warmwaffles

What would be really helpful is if there is a way I can replicate this error repeatedly.

warmwaffles avatar Sep 22 '22 22:09 warmwaffles

Hey @warmwaffles ,

I have been working with @keatz55 on recreating this issue. here is the test case that you can use to reproduce in your machine. Also what we noticed in our testing is that when we adjust timeout value to double the value of busy_timeout. we are not getting any segfaults.

test "Reproduce seg fault" do
    path = "./test.db"

    {:ok, conn} =
      DBConnection.start_link(Connection,
        busy_timeout: 50000,
        pool_size: 50,
        timeout: 1,
        database: "./test.db",
        journal_mode: :wal
      )

    query = %Query{statement: "create table foo(id integer, val integer)"}
    {:ok, _, _} = DBConnection.execute(conn, query, [])

    values = for i <- 1..1000, do: "(#{i}, #{i})"

    insert_query = %Query{
      statement: "insert into foo(id, val) values #{Enum.join(values, ",")}"
    }

    Task.async_stream(1..5000, fn _val ->
      try do
        DBConnection.execute(conn, insert_query, [],
          busy_timeout: 50000,
          timeout: 1
        )
      catch
        kind, reason ->
          IO.inspect("#{inspect(kind)} #{inspect(reason)}", label: "Error")
      end
    end)
    |> Stream.run()
  end

Output:

00:46:45.859 [error] Exqlite.Connection (#PID<0.277.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.519.0> timed out because it queued and checked out the connection for longer than 1ms

#PID<0.519.0> was at location:

    (exqlite 0.11.4) lib/exqlite/connection.ex:500: Exqlite.Connection.prepare/3
    (exqlite 0.11.4) lib/exqlite/connection.ex:194: Exqlite.Connection.handle_execute/4
    (db_connection 2.4.2) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.4.2) lib/db_connection.ex:1364: DBConnection.run_execute/5
    (db_connection 2.4.2) lib/db_connection.ex:1459: DBConnection.run/6
    (db_connection 2.4.2) lib/db_connection.ex:652: DBConnection.execute/4
    test/exqlite/integration_test.exs:225: anonymous fn/3 in Exqlite.IntegrationTest."test Reproduce seg fault"/1
    (elixir 1.13.4) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
    (elixir 1.13.4) lib/task/supervised.ex:34: Task.Supervised.reply/4
    (stdlib 4.0.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3

zsh: segmentation fault  mix test test/exqlite/integration_test.exs:202

pavancse17 avatar Sep 26 '22 19:09 pavancse17

Excellent news! I'll take a look at this soon.

warmwaffles avatar Sep 26 '22 19:09 warmwaffles

The busy_timeout here

        DBConnection.execute(conn, insert_query, [],
          busy_timeout: 50000,
          timeout: 1
        )

Has no effect on the query itself. The busy_timeout is used for only connecting to the database which runs PRAGMA busy_timeout = 50000

But that timeout is definitely something happening with the DBConnection pooling.

warmwaffles avatar Sep 27 '22 02:09 warmwaffles

@pavancse17 and @keatz55 right now if you keep your pool size at 1, this segfault won't occur.

I'm still in the process of trying to figure out where the address boundary error is happening precisely, but I am suspicious that it is around the open and closing of connections.

warmwaffles avatar Sep 27 '22 04:09 warmwaffles

I'm still looking in to this issue. Will need to probably utilize a custom compiled ERTS with valgrind compiled with it.

warmwaffles avatar Oct 03 '22 05:10 warmwaffles

I'm still having a difficult time trying to figure what exactly is causing this.

Here is the test that reproduces the issue.

https://github.com/elixir-sqlite/exqlite/compare/segfault-hunting

warmwaffles avatar Oct 24 '22 02:10 warmwaffles

I believe something funky is happening when a DBConnection times out. There is no callback to handle that defined upstream to allow for a graceful tear down of resources.

I don't have any concrete evidence yet, but I believe a reference to an sqlite database in memory is getting garbage collected when it shouldn't be, or a statement is getting garbage collected at the wrong time.

I do believe I am going to take some time and implement https://www.erlang.org/doc/man/erl_nif.html#enif_open_resource_type_x which has a ErlNifResourceDown callback that may be useful for this project to gracefully stop any running queries and do some better clean up.

warmwaffles avatar Oct 24 '22 03:10 warmwaffles