exqlite
exqlite copied to clipboard
Segfault when exceeding DBConnection timeout
I was debugging a sudden reboot on our nerves system with a expensive query. I had the memory usage and OOM in mind, but today I found out that raising the timeout actually makes the query succeed (no reboot). Making the timout be exceeded on my local machine I could reproduce the issue and even get the seg fault message (see below). I don't think the query matters for as long as the timeout is exceeded.
iex(6)> Task.start(fn -> Repo.checkout(fn -> Ecto.Adapters.SQL.query!(Repo, query) end, timeout: :timer.seconds(1)) end)
{:ok, #PID<0.2359.0>}
iex(7)> 11:24:58.427 [error] Exqlite.Connection (#PID<0.2164.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.2359.0> timed out because it queued and checked out the connection for longer than 1000ms
#PID<0.2359.0> was at location:
(exqlite 0.8.6) lib/exqlite/sqlite3.ex:91: Exqlite.Sqlite3.multi_step/3
(exqlite 0.8.6) lib/exqlite/sqlite3.ex:138: Exqlite.Sqlite3.fetch_all/4
(exqlite 0.8.6) lib/exqlite/connection.ex:529: Exqlite.Connection.get_rows/2
(exqlite 0.8.6) lib/exqlite/connection.ex:475: Exqlite.Connection.execute/4
(db_connection 2.4.1) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
(db_connection 2.4.1) lib/db_connection.ex:1333: DBConnection.run_execute/5
(db_connection 2.4.1) lib/db_connection.ex:650: DBConnection.execute/4
11:25:00.283 [debug] …
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[os_mon] memory supervisor port (memsup): Erlang has closed
[1] 37300 segmentation fault iex -S mix
I just saw this one, which is probably related: https://github.com/elixir-sqlite/exqlite/issues/127
I am still seeing the segfault described above in my app.
19:21:45.978 [error] Exqlite.Connection (#PID<0.2619.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.2866.0> timed out because it queued and checked out the connection for longer than 15000m
signal: segmentation fault
Is there a way to increase this timeout? I don't think any of the timeout settings are affecting it for me. Any way I can help to debug it?
Which version are you running?
Do you have a reliable way to reproduce the segfault?
I am running on master
. I'll try to reproduce it in a test file to share and report back 👍🏻
Experiencing the same issue. In our case, we've set DBConnection :timeout
and Exqlite :busy_timeout
to the same value. Could a potential cause be that the DBConnection execute times out and terminates Exqlite before it has an opportunity to free up memory? Then on the subsequent connection creation, Exqlite attempts to allocate memory on the wrong block of data?
Could a potential cause be that the DBConnection execute times out and terminates Exqlite before it has an opportunity to free up memory?
So as it stands right now with the DirtyNIF implementation, the DBConnection will crash but the query running in the background never actually gets terminated. If you are operating on a large database it is quite possible that your query would continue to run even if the connection timed out.
I have been meaning to find time to swap out the DirtyNIF model for a runtime that can cancel running queries.
Then on the subsequent connection creation, Exqlite attempts to allocate memory on the wrong block of data?
We are using enif_alloc/1
and enif_free/1
under the hood for all sqlite memory allocations. This is so that when you look at the live dashboard or query the VM you can see accurate memory usage by the erlang process. Otherwise the bytes that sqlite uses would not be tracked.
https://github.com/elixir-sqlite/exqlite/blob/2c9fa05b64cab9a5470adeae2e317f1480fbd0ca/c_src/sqlite3_nif.c#L32-L107
It is entirely possible that I encoded a bug there, but I don't think that is the case. The first sizeof(int)
of the bytes allocated are reserved for the block size of memory.
What would be really helpful is if there is a way I can replicate this error repeatedly.
Hey @warmwaffles ,
I have been working with @keatz55 on recreating this issue. here is the test case that you can use to reproduce in your machine. Also what we noticed in our testing is that when we adjust timeout
value to double the value of busy_timeout
. we are not getting any segfaults.
test "Reproduce seg fault" do
path = "./test.db"
{:ok, conn} =
DBConnection.start_link(Connection,
busy_timeout: 50000,
pool_size: 50,
timeout: 1,
database: "./test.db",
journal_mode: :wal
)
query = %Query{statement: "create table foo(id integer, val integer)"}
{:ok, _, _} = DBConnection.execute(conn, query, [])
values = for i <- 1..1000, do: "(#{i}, #{i})"
insert_query = %Query{
statement: "insert into foo(id, val) values #{Enum.join(values, ",")}"
}
Task.async_stream(1..5000, fn _val ->
try do
DBConnection.execute(conn, insert_query, [],
busy_timeout: 50000,
timeout: 1
)
catch
kind, reason ->
IO.inspect("#{inspect(kind)} #{inspect(reason)}", label: "Error")
end
end)
|> Stream.run()
end
Output:
00:46:45.859 [error] Exqlite.Connection (#PID<0.277.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.519.0> timed out because it queued and checked out the connection for longer than 1ms
#PID<0.519.0> was at location:
(exqlite 0.11.4) lib/exqlite/connection.ex:500: Exqlite.Connection.prepare/3
(exqlite 0.11.4) lib/exqlite/connection.ex:194: Exqlite.Connection.handle_execute/4
(db_connection 2.4.2) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
(db_connection 2.4.2) lib/db_connection.ex:1364: DBConnection.run_execute/5
(db_connection 2.4.2) lib/db_connection.ex:1459: DBConnection.run/6
(db_connection 2.4.2) lib/db_connection.ex:652: DBConnection.execute/4
test/exqlite/integration_test.exs:225: anonymous fn/3 in Exqlite.IntegrationTest."test Reproduce seg fault"/1
(elixir 1.13.4) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
(elixir 1.13.4) lib/task/supervised.ex:34: Task.Supervised.reply/4
(stdlib 4.0.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
zsh: segmentation fault mix test test/exqlite/integration_test.exs:202
Excellent news! I'll take a look at this soon.
The busy_timeout
here
DBConnection.execute(conn, insert_query, [],
busy_timeout: 50000,
timeout: 1
)
Has no effect on the query itself. The busy_timeout
is used for only connecting to the database which runs PRAGMA busy_timeout = 50000
But that timeout
is definitely something happening with the DBConnection pooling.
@pavancse17 and @keatz55 right now if you keep your pool size at 1, this segfault won't occur.
I'm still in the process of trying to figure out where the address boundary error is happening precisely, but I am suspicious that it is around the open and closing of connections.
I'm still looking in to this issue. Will need to probably utilize a custom compiled ERTS with valgrind compiled with it.
I'm still having a difficult time trying to figure what exactly is causing this.
Here is the test that reproduces the issue.
https://github.com/elixir-sqlite/exqlite/compare/segfault-hunting
I believe something funky is happening when a DBConnection
times out. There is no callback to handle that defined upstream to allow for a graceful tear down of resources.
I don't have any concrete evidence yet, but I believe a reference to an sqlite database in memory is getting garbage collected when it shouldn't be, or a statement is getting garbage collected at the wrong time.
I do believe I am going to take some time and implement https://www.erlang.org/doc/man/erl_nif.html#enif_open_resource_type_x which has a ErlNifResourceDown
callback that may be useful for this project to gracefully stop any running queries and do some better clean up.