epgsql icon indicating copy to clipboard operation
epgsql copied to clipboard

fix: reduce log spamming when connection goes down

Open thalesmg opened this issue 1 year ago • 9 comments

When a socket goes down due to connectivity issues (e.g.: the connection is flaky and is broken), then a big crash report is logged. If this process is supervised and restarted, that may lead to lots of logs being generated.

Here, econnrefused stop reasons are treated as a normal shutdown, so a crash report from gen_server is avoided.

thalesmg avatar May 20 '24 14:05 thalesmg

Hey, may you please attach the examples of error logs before and after this change?

seriyps avatar May 21 '24 15:05 seriyps

I am slightly opposed to this change because

  • in epgsql we don't really do much to avoid crashes in the connection process if something is not going according to plan. And to have too many special cases for special errors would make driver's behavior less predictable
  • maybe excess logging can be solved by OTP logger configuration? Eg, special filters or handlers or formatters?
  • or maybe we can add special handler to gen_server callback format_status as an option

But please let's look at how logs change before and after this change.

seriyps avatar May 21 '24 15:05 seriyps

Hi @seriyps !

in epgsql we don't really do much to avoid crashes in the connection process if something is not going according to plan. And to have too many special cases for special errors would make driver's behavior less predictable

Please note that we are not avoiding crashes here. We're just avoiding crash reports from polluting the logs.

or maybe we can add special handler to gen_server callback format_status as an option

It's not possible to customize this behavior with format_status. This is the default behavior from gen_server when the stop reason is not normal | shutdown | {shutdown, term()}.

Quoting gen_server:Module:terminate callback's docs:

Notice that for any other reason than normal, shutdown, or {shutdown,Term}, see stop/3, the gen_server process is assumed to terminate because of an error, and an error report is issued using logger(3).

See: https://github.com/erlang/otp/blob/412bff5196fc0ab88a61fe37ca30e5226fc7872d/lib/stdlib/src/gen_server.erl#L1345-L1353

Hey, may you please attach the examples of error logs before and after this change?

Sample logs before the change

Sample logs before the change
{"time":"2024-05-09T14:09:23.079766+00:00","level":"error","msg":"Generic server <0.7382587.0> terminating. Reason: sock_closed. Last message: {ssl_closed,{sslsocket,{gen_tcp,#Port<0.2053409>,tls_connection,undefined},[<0.7382604.0>,<0.7382600.0>]}}. State: {state,ssl,undefined,<<>>,{-787269458,854068435},on_message,{codec,#{},[null,undefined],{oid_db,#{16 => {type,16,bool,false,undefined,epgsql_codec_boolean,[]},1041 => {type,1041,inet,true,869,epgsql_codec_net,[]},18 => {type,18,char,false,undefined,epgsql_codec_bpchar,[]},1115 => {type,1115,timestamp,true,1114,epgsql_codec_datetime,epgsql_idatetime},651 => {type,651,cidr,true,650,epgsql_codec_net,[]},114 => {type,114,json,false,undefined,epgsql_codec_json,[]},1022 => {type,1022,float8,true,701,epgsql_codec_float,[]},3912 => {type,3912,daterange,false,undefined,epgsql_codec_timerange,epgsql_idatetime},3909 => {type,3909,tsrange,true,3908,epgsql_codec_timerange,epgsql_idatetime},1184 => {type,1184,timestamptz,false,undefined,epgsql_codec_datetime,epgsql_idatetime},1114 => {type,1114,timestamp,false,undefined,epgsql_codec_datetime,epgsql_idatetime},1270 => {type,1270,timetz,true,1266,epgsql_codec_datetime,epgsql_idatetime},1042 => {type,1042,bpchar,false,undefined,epgsql_codec_bpchar,[]},600 => {type,600,point,false,undefined,epgsql_codec_geometric,[]},3904 => {type,3904,int4range,false,undefined,epgsql_codec_intrange,[]},700 => {type,700,float4,false,undefined,epgsql_codec_float,[]},1185 => {type,1185,timestamptz,true,1184,epgsql_codec_datetime,epgsql_idatetime},1266 => {type,1266,timetz,false,undefined,epgsql_codec_datetime,epgsql_idatetime},650 => {type,650,cidr,false,undefined,epgsql_codec_net,[]},1016 => {type,1016,int8,true,20,epgsql_codec_integer,[]},1021 => {type,1021,float4,true,700,epgsql_codec_float,[]},21 => {type,21,int2,false,undefined,epgsql_codec_integer,[]},1017 => {type,1017,point,true,600,epgsql_codec_geometric,[]},1182 => {type,1182,date,true,1082,epgsql_codec_datetime,epgsql_idatetime},3908 => {type,3908,tsrange,false,undefined,epgsql_codec_timerange,epgsql_idatetime},1005 => {type,1005,int2,true,21,epgsql_codec_integer,[]},17 => {type,17,bytea,false,undefined,epgsql_codec_text,[]},1002 => {type,1002,char,true,18,epgsql_codec_bpchar,[]},3910 => {type,3910,tstzrange,false,undefined,epgsql_codec_timerange,epgsql_idatetime},3927 => {type,3927,int8range,true,3926,epgsql_codec_intrange,[]},3807 => {type,3807,jsonb,true,3802,epgsql_codec_json,[]},1082 => {type,1082,date,false,undefined,epgsql_codec_datetime,epgsql_idatetime},3913 => {type,3913,daterange,true,3912,epgsql_codec_timerange,epgsql_idatetime},1183 => {type,1183,time,true,1083,epgsql_codec_datetime,epgsql_idatetime},20 => {type,20,int8,false,undefined,epgsql_codec_integer,[]},1007 => {type,1007,int4,true,23,epgsql_codec_integer,[]},1083 => {type,1083,time,false,undefined,epgsql_codec_datetime,epgsql_idatetime},2951 => {type,2951,uuid,true,2950,epgsql_codec_uuid,[]},2950 => {type,2950,uuid,false,undefined,epgsql_codec_uuid,[]},3911 => {type,3911,tstzrange,true,3910,epgsql_codec_timerange,epgsql_idatetime},1043 => {type,1043,varchar,false,undefined,epgsql_codec_text,[]},1014 => {type,1014,bpchar,true,1042,epgsql_codec_bpchar,[]},199 => {type,199,json,true,114,epgsql_codec_json,[]},25 => {type,25,text,false,undefined,epgsql_codec_text,[]},1009 => {type,1009,text,true,25,epgsql_codec_text,[]},1000 => {type,1000,bool,true,16,epgsql_codec_boolean,[]},701 => {type,701,float8,false,undefined,epgsql_codec_float,[]},3905 => {type,3905,int4range,true,3904,epgsql_codec_intrange,[]},1015 => {type,1015,varchar,true,1043,epgsql_codec_text,[]},3802 => {type,3802,jsonb,false,undefined,epgsql_codec_json,[]},829 => {type,829,macaddr,false,undefined,epgsql_codec_net,[]},1187 => {type,1187,interval,true,1186,epgsql_codec_datetime,epgsql_idatetime},774 => {type,774,macaddr8,false,undefined,epgsql_codec_net,[]},1186 => {type,1186,interval,false,undefined,epgsql_codec_datetime,epgsql_idatetime},1040 => {type,1040,macaddr,true,829,epgsql_codec_net,[]},3926 => {type,3926,int8range,false,undefined,epgsql_codec_intrange,[]},775 => {type,775,macaddr8,true,774,epgsql_codec_net,[]},23 => {type,23,int4,false,undefined,epgsql_codec_integer,[]},1001 => {type,1001,bytea,true,17,epgsql_codec_text,[]},869 => {type,869,inet,false,undefined,epgsql_codec_net,[]}},#{{bool,false} => 16,{point,true} => 1017,{uuid,true} => 2951,{json,true} => 199,{macaddr8,false} => 774,{char,false} => 18,{timestamptz,true} => 1185,{varchar,false} => 1043,{int4range,false} => 3904,{text,true} => 1009,{tstzrange,true} => 3911,{daterange,true} => 3913,{int4,true} => 1007,{timetz,false} => 1266,{bool,true} => 1000,{jsonb,true} => 3807,{int8,true} => 1016,{timetz,true} => 1270,{macaddr8,true} => 775,{int8range,true} => 3927,{float8,false} => 701,{float4,false} => 700,{inet,false} => 869,{cidr,true} => 651,{daterange,false} => 3912,{tsrange,true} => 3909,{int2,false} => 21,{json,false} => 114,{char,true} => 1002,{tsrange,false} => 3908,{interval,true} => 1187,{float4,true} => 1021,{date,false} => 1082,{tstzrange,false} => 3910,{timestamp,true} => 1115,{timestamptz,false} => 1184,{interval,false} => 1186,{int8range,false} => 3926,{bpchar,false} => 1042,{int8,false} => 20,{varchar,true} => 1015,{bytea,false} => 17,{point,false} => 600,{bpchar,true} => 1014,{int2,true} => 1005,{time,true} => 1183,{macaddr,true} => 1040,{jsonb,false} => 3802,{inet,true} => 1041,{int4,false} => 23,{time,false} => 1083,{text,false} => 25,{float8,true} => 1022,{macaddr,false} => 829,{uuid,false} => 2950,{int4range,true} => 3905,{bytea,true} => 1001,{date,true} => 1182,{cidr,false} => 650,{timestamp,false} => 1114}}},{[],[]},undefined,undefined,undefined,undefined,[{<<\"default_transaction_read_only\">>,<<\"off\">>},{<<\"scram_iterations\">>,<<\"4096\">>},{<<\"client_encoding\">>,<<\"UTF8\">>},{<<\"standard_conforming_strings\">>,<<\"on\">>},{<<\"application_name\">>,<<>>},{<<\"IntervalStyle\">>,<<\"postgres\">>},{<<\"integer_datetimes\">>,<<\"on\">>},{<<\"TimeZone\">>,<<\"GMT\">>},{<<\"DateStyle\">>,<<\"ISO, MDY\">>},{<<\"is_superuser\">>,<<\"off\">>},{<<\"in_hot_standby\">>,<<\"off\">>},{<<\"session_authorization\">>,<<\"neondb_owner\">>},{<<\"server_version\">>,<<\"16.2\">>},{<<\"server_encoding\">>,<<\"UTF8\">>}],information_redacted,[],true,73,undefined,undefined,#{port => 5432,ssl => required,host => \"ep-hidden-butterfly-a56rrawp-pooler.us-east-2.aws.neon.tech\",database => <<\"gnss\">>,ssl_opts => [{verify,verify_none},{server_name_indication,\"ep-hidden-butterfly-a56rrawp-pooler.us-east-2.aws.neon.tech\"},{versions,['tlsv1.3','tlsv1.2']},{ciphers,[\"TLS_AES_256_GCM_SHA384\",\"TLS_AES_128_GCM_SHA256\",\"TLS_CHACHA20_POLY1305_SHA256\",\"TLS_AES_128_CCM_SHA256\",\"TLS_AES_128_CCM_8_SHA256\",\"ECDHE-ECDSA-AES256-GCM-SHA384\",\"ECDHE-RSA-AES256-GCM-SHA384\",\"ECDHE-ECDSA-AES256-SHA384\",\"ECDHE-RSA-AES256-SHA384\",\"ECDH-ECDSA-AES256-GCM-SHA384\",\"ECDH-RSA-AES256-GCM-SHA384\",\"ECDH-ECDSA-AES256-SHA384\",\"ECDH-RSA-AES256-SHA384\",\"DHE-DSS-AES256-GCM-SHA384\",\"DHE-DSS-AES256-SHA256\",\"AES256-GCM-SHA384\",\"AES256-SHA256\",\"ECDHE-ECDSA-AES128-GCM-SHA256\",\"ECDHE-RSA-AES128-GCM-SHA256\",\"ECDHE-ECDSA-AES128-SHA256\",\"ECDHE-RSA-AES128-SHA256\",\"ECDH-ECDSA-AES128-GCM-SHA256\",\"ECDH-RSA-AES128-GCM-SHA256\",\"ECDH-ECDSA-AES128-SHA256\",\"ECDH-RSA-AES128-SHA256\",\"DHE-DSS-AES128-GCM-SHA256\",\"DHE-DSS-AES128-SHA256\",\"AES128-GCM-SHA256\",\"AES128-SHA256\",\"ECDHE-ECDSA-AES256-SHA\",\"ECDHE-RSA-AES256-SHA\",\"DHE-DSS-AES256-SHA\",\"ECDH-ECDSA-AES256-SHA\",\"ECDH-RSA-AES256-SHA\",\"ECDHE-ECDSA-AES128-SHA\",\"ECDHE-RSA-AES128-SHA\",\"DHE-DSS-AES128-SHA\",\"ECDH-ECDSA-AES128-SHA\",\"ECDH-RSA-AES128-SHA\",\"RSA-PSK-AES256-GCM-SHA384\",\"RSA-PSK-AES256-CBC-SHA384\",\"RSA-PSK-AES128-GCM-SHA256\",\"RSA-PSK-AES128-CBC-SHA256\",\"RSA-PSK-AES256-CBC-SHA\",\"RSA-PSK-AES128-CBC-SHA\"]},{reuse_sessions,true},{depth,10},{secure_renegotiate,true}]},#{<<\"action:timescale:action-acaf4559:connector:timescale:connector-90005771\">> => {statement,<<\"action:timescale:action-acaf4559:connector:timescale:connector-90005771\">>,[],[int8,text,text,text,text,text,text,text,text,text,{unknown_oid,1700},jsonb,text],[{20,int8,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{25,text,false},{unknown_oid,1700},{3802,jsonb,false},{25,text,false}]}}}.","mfa":"{gen_server,error_info,8}","domain":["otp"],"pid":"<0.7382587.0>","line":1391,"error_logger":{"tag":"error","report_cb":"fun gen_server:format_log/1"}}

Sample logs after the change

No logs: there's no crash report, and the supervised process is just periodically restarted until connection is reestablished.

thalesmg avatar May 21 '24 16:05 thalesmg

Note that the problem here is not only that the crash report is huge. That could be indeed solved by an appropriate format_status callback.

The problem is also that this crash reported is printed every single time the connection returns econnrefused. So, if there is some transient network issue that prevents the connection, then those log lines will be printed over and over until the network condition is resolved.

thalesmg avatar May 21 '24 16:05 thalesmg

But if there will be no logs, how do we notice (from the Erlang app logs) that the database went down? I just feel that by silently hiding errors like that we make more harm than good because from the app logs it may look like everything is great while in fact the app is broken (due to DB being down)?

seriyps avatar Jul 01 '24 08:07 seriyps

Usually, this kind of process lives under supervision. So, if it were to log this huge crash report (or even a single line, for that matter), it would be a bit annoying, and also checking the logs for the app health is perhaps not the best way to monitor it.

Instead, one option that is commonly used is to have some kind of health check that raises an alarm once if they fail, which is what we do in EMQX.

Also, note that the changes herein are very specific to the econnrefused stop reason: other kinds of reasons to stop the socket will still fail loudly. :wink:

thalesmg avatar Jul 01 '24 17:07 thalesmg

Hey, sorry for delay. I have a proposal: can we make this behavior configurable? i.e. if there is a connect option, say, silent_crash (or whatever better describes this behavior), then we stop the process with reason shutdown or normal. But by default we leave the old behavior. Maybe we can make it not boolean, but a list of common shutdown scenarios where we want it to not spam the logs, eg, silent_crash => [socket_closed, econnrefused, ssl_error, ....]

seriyps avatar Oct 31 '24 10:10 seriyps

Hi, I'll try to push the changes during the week, thanks for the feedback.

thalesmg avatar Nov 04 '24 18:11 thalesmg

Hi @seriyps , have you had the opportunity to check if the current state is more acceptable?

thalesmg avatar Oct 01 '25 17:10 thalesmg