fix: reduce log spamming when connection goes down
When a socket goes down due to connectivity issues (e.g.: the connection is flaky and is broken), then a big crash report is logged. If this process is supervised and restarted, that may lead to lots of logs being generated.
Here, econnrefused stop reasons are treated as a normal shutdown, so a crash report from gen_server is avoided.
Hey, may you please attach the examples of error logs before and after this change?
I am slightly opposed to this change because
- in epgsql we don't really do much to avoid crashes in the connection process if something is not going according to plan. And to have too many special cases for special errors would make driver's behavior less predictable
- maybe excess logging can be solved by OTP logger configuration? Eg, special filters or handlers or formatters?
- or maybe we can add special handler to gen_server callback
format_statusas an option
But please let's look at how logs change before and after this change.
Hi @seriyps !
in epgsql we don't really do much to avoid crashes in the connection process if something is not going according to plan. And to have too many special cases for special errors would make driver's behavior less predictable
Please note that we are not avoiding crashes here. We're just avoiding crash reports from polluting the logs.
or maybe we can add special handler to gen_server callback format_status as an option
It's not possible to customize this behavior with format_status. This is the default behavior from gen_server when the stop reason is not normal | shutdown | {shutdown, term()}.
Quoting gen_server:Module:terminate callback's docs:
Notice that for any other reason than normal, shutdown, or {shutdown,Term}, see stop/3, the gen_server process is assumed to terminate because of an error, and an error report is issued using logger(3).
See: https://github.com/erlang/otp/blob/412bff5196fc0ab88a61fe37ca30e5226fc7872d/lib/stdlib/src/gen_server.erl#L1345-L1353
Hey, may you please attach the examples of error logs before and after this change?
Sample logs before the change
Sample logs before the change
{"time":"2024-05-09T14:09:23.079766+00:00","level":"error","msg":"Generic server <0.7382587.0> terminating. Reason: sock_closed. Last message: {ssl_closed,{sslsocket,{gen_tcp,#Port<0.2053409>,tls_connection,undefined},[<0.7382604.0>,<0.7382600.0>]}}. State: {state,ssl,undefined,<<>>,{-787269458,854068435},on_message,{codec,#{},[null,undefined],{oid_db,#{16 => {type,16,bool,false,undefined,epgsql_codec_boolean,[]},1041 => {type,1041,inet,true,869,epgsql_codec_net,[]},18 => {type,18,char,false,undefined,epgsql_codec_bpchar,[]},1115 => {type,1115,timestamp,true,1114,epgsql_codec_datetime,epgsql_idatetime},651 => {type,651,cidr,true,650,epgsql_codec_net,[]},114 => {type,114,json,false,undefined,epgsql_codec_json,[]},1022 => {type,1022,float8,true,701,epgsql_codec_float,[]},3912 => {type,3912,daterange,false,undefined,epgsql_codec_timerange,epgsql_idatetime},3909 => {type,3909,tsrange,true,3908,epgsql_codec_timerange,epgsql_idatetime},1184 => {type,1184,timestamptz,false,undefined,epgsql_codec_datetime,epgsql_idatetime},1114 => {type,1114,timestamp,false,undefined,epgsql_codec_datetime,epgsql_idatetime},1270 => {type,1270,timetz,true,1266,epgsql_codec_datetime,epgsql_idatetime},1042 => {type,1042,bpchar,false,undefined,epgsql_codec_bpchar,[]},600 => {type,600,point,false,undefined,epgsql_codec_geometric,[]},3904 => {type,3904,int4range,false,undefined,epgsql_codec_intrange,[]},700 => {type,700,float4,false,undefined,epgsql_codec_float,[]},1185 => {type,1185,timestamptz,true,1184,epgsql_codec_datetime,epgsql_idatetime},1266 => {type,1266,timetz,false,undefined,epgsql_codec_datetime,epgsql_idatetime},650 => {type,650,cidr,false,undefined,epgsql_codec_net,[]},1016 => {type,1016,int8,true,20,epgsql_codec_integer,[]},1021 => {type,1021,float4,true,700,epgsql_codec_float,[]},21 => {type,21,int2,false,undefined,epgsql_codec_integer,[]},1017 => {type,1017,point,true,600,epgsql_codec_geometric,[]},1182 => {type,1182,date,true,1082,epgsql_codec_datetime,epgsql_idatetime},3908 => {type,3908,tsrange,false,undefined,epgsql_codec_timerange,epgsql_idatetime},1005 => {type,1005,int2,true,21,epgsql_codec_integer,[]},17 => {type,17,bytea,false,undefined,epgsql_codec_text,[]},1002 => {type,1002,char,true,18,epgsql_codec_bpchar,[]},3910 => {type,3910,tstzrange,false,undefined,epgsql_codec_timerange,epgsql_idatetime},3927 => {type,3927,int8range,true,3926,epgsql_codec_intrange,[]},3807 => {type,3807,jsonb,true,3802,epgsql_codec_json,[]},1082 => {type,1082,date,false,undefined,epgsql_codec_datetime,epgsql_idatetime},3913 => {type,3913,daterange,true,3912,epgsql_codec_timerange,epgsql_idatetime},1183 => {type,1183,time,true,1083,epgsql_codec_datetime,epgsql_idatetime},20 => {type,20,int8,false,undefined,epgsql_codec_integer,[]},1007 => {type,1007,int4,true,23,epgsql_codec_integer,[]},1083 => {type,1083,time,false,undefined,epgsql_codec_datetime,epgsql_idatetime},2951 => {type,2951,uuid,true,2950,epgsql_codec_uuid,[]},2950 => {type,2950,uuid,false,undefined,epgsql_codec_uuid,[]},3911 => {type,3911,tstzrange,true,3910,epgsql_codec_timerange,epgsql_idatetime},1043 => {type,1043,varchar,false,undefined,epgsql_codec_text,[]},1014 => {type,1014,bpchar,true,1042,epgsql_codec_bpchar,[]},199 => {type,199,json,true,114,epgsql_codec_json,[]},25 => {type,25,text,false,undefined,epgsql_codec_text,[]},1009 => {type,1009,text,true,25,epgsql_codec_text,[]},1000 => {type,1000,bool,true,16,epgsql_codec_boolean,[]},701 => {type,701,float8,false,undefined,epgsql_codec_float,[]},3905 => {type,3905,int4range,true,3904,epgsql_codec_intrange,[]},1015 => {type,1015,varchar,true,1043,epgsql_codec_text,[]},3802 => {type,3802,jsonb,false,undefined,epgsql_codec_json,[]},829 => {type,829,macaddr,false,undefined,epgsql_codec_net,[]},1187 => {type,1187,interval,true,1186,epgsql_codec_datetime,epgsql_idatetime},774 => {type,774,macaddr8,false,undefined,epgsql_codec_net,[]},1186 => {type,1186,interval,false,undefined,epgsql_codec_datetime,epgsql_idatetime},1040 => {type,1040,macaddr,true,829,epgsql_codec_net,[]},3926 => {type,3926,int8range,false,undefined,epgsql_codec_intrange,[]},775 => {type,775,macaddr8,true,774,epgsql_codec_net,[]},23 => {type,23,int4,false,undefined,epgsql_codec_integer,[]},1001 => {type,1001,bytea,true,17,epgsql_codec_text,[]},869 => {type,869,inet,false,undefined,epgsql_codec_net,[]}},#{{bool,false} => 16,{point,true} => 1017,{uuid,true} => 2951,{json,true} => 199,{macaddr8,false} => 774,{char,false} => 18,{timestamptz,true} => 1185,{varchar,false} => 1043,{int4range,false} => 3904,{text,true} => 1009,{tstzrange,true} => 3911,{daterange,true} => 3913,{int4,true} => 1007,{timetz,false} => 1266,{bool,true} => 1000,{jsonb,true} => 3807,{int8,true} => 1016,{timetz,true} => 1270,{macaddr8,true} => 775,{int8range,true} => 3927,{float8,false} => 701,{float4,false} => 700,{inet,false} => 869,{cidr,true} => 651,{daterange,false} => 3912,{tsrange,true} => 3909,{int2,false} => 21,{json,false} => 114,{char,true} => 1002,{tsrange,false} => 3908,{interval,true} => 1187,{float4,true} => 1021,{date,false} => 1082,{tstzrange,false} => 3910,{timestamp,true} => 1115,{timestamptz,false} => 1184,{interval,false} => 1186,{int8range,false} => 3926,{bpchar,false} => 1042,{int8,false} => 20,{varchar,true} => 1015,{bytea,false} => 17,{point,false} => 600,{bpchar,true} => 1014,{int2,true} => 1005,{time,true} => 1183,{macaddr,true} => 1040,{jsonb,false} => 3802,{inet,true} => 1041,{int4,false} => 23,{time,false} => 1083,{text,false} => 25,{float8,true} => 1022,{macaddr,false} => 829,{uuid,false} => 2950,{int4range,true} => 3905,{bytea,true} => 1001,{date,true} => 1182,{cidr,false} => 650,{timestamp,false} => 1114}}},{[],[]},undefined,undefined,undefined,undefined,[{<<\"default_transaction_read_only\">>,<<\"off\">>},{<<\"scram_iterations\">>,<<\"4096\">>},{<<\"client_encoding\">>,<<\"UTF8\">>},{<<\"standard_conforming_strings\">>,<<\"on\">>},{<<\"application_name\">>,<<>>},{<<\"IntervalStyle\">>,<<\"postgres\">>},{<<\"integer_datetimes\">>,<<\"on\">>},{<<\"TimeZone\">>,<<\"GMT\">>},{<<\"DateStyle\">>,<<\"ISO, MDY\">>},{<<\"is_superuser\">>,<<\"off\">>},{<<\"in_hot_standby\">>,<<\"off\">>},{<<\"session_authorization\">>,<<\"neondb_owner\">>},{<<\"server_version\">>,<<\"16.2\">>},{<<\"server_encoding\">>,<<\"UTF8\">>}],information_redacted,[],true,73,undefined,undefined,#{port => 5432,ssl => required,host => \"ep-hidden-butterfly-a56rrawp-pooler.us-east-2.aws.neon.tech\",database => <<\"gnss\">>,ssl_opts => [{verify,verify_none},{server_name_indication,\"ep-hidden-butterfly-a56rrawp-pooler.us-east-2.aws.neon.tech\"},{versions,['tlsv1.3','tlsv1.2']},{ciphers,[\"TLS_AES_256_GCM_SHA384\",\"TLS_AES_128_GCM_SHA256\",\"TLS_CHACHA20_POLY1305_SHA256\",\"TLS_AES_128_CCM_SHA256\",\"TLS_AES_128_CCM_8_SHA256\",\"ECDHE-ECDSA-AES256-GCM-SHA384\",\"ECDHE-RSA-AES256-GCM-SHA384\",\"ECDHE-ECDSA-AES256-SHA384\",\"ECDHE-RSA-AES256-SHA384\",\"ECDH-ECDSA-AES256-GCM-SHA384\",\"ECDH-RSA-AES256-GCM-SHA384\",\"ECDH-ECDSA-AES256-SHA384\",\"ECDH-RSA-AES256-SHA384\",\"DHE-DSS-AES256-GCM-SHA384\",\"DHE-DSS-AES256-SHA256\",\"AES256-GCM-SHA384\",\"AES256-SHA256\",\"ECDHE-ECDSA-AES128-GCM-SHA256\",\"ECDHE-RSA-AES128-GCM-SHA256\",\"ECDHE-ECDSA-AES128-SHA256\",\"ECDHE-RSA-AES128-SHA256\",\"ECDH-ECDSA-AES128-GCM-SHA256\",\"ECDH-RSA-AES128-GCM-SHA256\",\"ECDH-ECDSA-AES128-SHA256\",\"ECDH-RSA-AES128-SHA256\",\"DHE-DSS-AES128-GCM-SHA256\",\"DHE-DSS-AES128-SHA256\",\"AES128-GCM-SHA256\",\"AES128-SHA256\",\"ECDHE-ECDSA-AES256-SHA\",\"ECDHE-RSA-AES256-SHA\",\"DHE-DSS-AES256-SHA\",\"ECDH-ECDSA-AES256-SHA\",\"ECDH-RSA-AES256-SHA\",\"ECDHE-ECDSA-AES128-SHA\",\"ECDHE-RSA-AES128-SHA\",\"DHE-DSS-AES128-SHA\",\"ECDH-ECDSA-AES128-SHA\",\"ECDH-RSA-AES128-SHA\",\"RSA-PSK-AES256-GCM-SHA384\",\"RSA-PSK-AES256-CBC-SHA384\",\"RSA-PSK-AES128-GCM-SHA256\",\"RSA-PSK-AES128-CBC-SHA256\",\"RSA-PSK-AES256-CBC-SHA\",\"RSA-PSK-AES128-CBC-SHA\"]},{reuse_sessions,true},{depth,10},{secure_renegotiate,true}]},#{<<\"action:timescale:action-acaf4559:connector:timescale:connector-90005771\">> => {statement,<<\"action:timescale:action-acaf4559:connector:timescale:connector-90005771\">>,[],[int8,text,text,text,text,text,text,text,text,text,{unknown_oid,1700},jsonb,text],[{20,int8,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{unknown_oid,1700},{3802,jsonb,false},{25,text,false}]}}}.","mfa":"{gen_server,error_info,8}","domain":["otp"],"pid":"<0.7382587.0>","line":1391,"error_logger":{"tag":"error","report_cb":"fun gen_server:format_log/1"}}
Sample logs after the change
No logs: there's no crash report, and the supervised process is just periodically restarted until connection is reestablished.
Note that the problem here is not only that the crash report is huge. That could be indeed solved by an appropriate format_status callback.
The problem is also that this crash reported is printed every single time the connection returns econnrefused. So, if there is some transient network issue that prevents the connection, then those log lines will be printed over and over until the network condition is resolved.
But if there will be no logs, how do we notice (from the Erlang app logs) that the database went down? I just feel that by silently hiding errors like that we make more harm than good because from the app logs it may look like everything is great while in fact the app is broken (due to DB being down)?
Usually, this kind of process lives under supervision. So, if it were to log this huge crash report (or even a single line, for that matter), it would be a bit annoying, and also checking the logs for the app health is perhaps not the best way to monitor it.
Instead, one option that is commonly used is to have some kind of health check that raises an alarm once if they fail, which is what we do in EMQX.
Also, note that the changes herein are very specific to the econnrefused stop reason: other kinds of reasons to stop the socket will still fail loudly. :wink:
Hey, sorry for delay. I have a proposal: can we make this behavior configurable? i.e. if there is a connect option, say, silent_crash (or whatever better describes this behavior), then we stop the process with reason shutdown or normal. But by default we leave the old behavior. Maybe we can make it not boolean, but a list of common shutdown scenarios where we want it to not spam the logs, eg, silent_crash => [socket_closed, econnrefused, ssl_error, ....]
Hi, I'll try to push the changes during the week, thanks for the feedback.
Hi @seriyps , have you had the opportunity to check if the current state is more acceptable?