couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

CouchDb Crashes in production

Open idea-christian opened this issue 6 months ago • 9 comments

Description

Our CouchDB server continues to crash every few hours, the log is more or less always the same as shown below. I do understand that we run into a system limit (and the system usage is very high) but I want to understand what exactly is the issue.

I did already adjust the [query_server_config] os_process_limit (2000) and os_process_soft_limit (200) but I guess our server is simply not getting the load done in time?

[error] 2025-05-20T05:39:47.225573Z [email protected] emulator -------- Too many processes
[error] 2025-05-20T05:39:47.225613Z [email protected] emulator -------- Too many processes
[error] 2025-05-20T05:39:47.225741Z [email protected] emulator -------- Error in process <0.74769704.0> on node '[email protected]' with exit value:
{system_limit,[{erlang,spawn_link,[erlang,apply,[#Fun<couch_mrview_updater.0.48133257>,[]]],[{error_info,#{module => erl_erts_errors}}]},{erlang,spawn_link,1,[]},{couch_mrview_updater,start_update,4,[{file,"src/couch_mrview_updater.erl"},{line,67}]},{couch_index_updater,'-update/3-fun-4-',8,[{file,"src/couch_index_updater.erl"},{line,173}]},{couch_util,with_db,2,[{file,"src/couch_util.erl"},{line,559}]}]}

[error] 2025-05-20T05:39:47.225809Z [email protected] emulator -------- Error in process <0.74769704.0> on node '[email protected]' with exit value:
{system_limit,[{erlang,spawn_link,[erlang,apply,[#Fun<couch_mrview_updater.0.48133257>,[]]],[{error_info,#{module => erl_erts_errors}}]},{erlang,spawn_link,1,[]},{couch_mrview_updater,start_update,4,[{file,"src/couch_mrview_updater.erl"},{line,67}]},{couch_index_updater,'-update/3-fun-4-',8,[{file,"src/couch_index_updater.erl"},{line,173}]},{couch_util,with_db,2,[{file,"src/couch_util.erl"},{line,559}]}]}

[error] 2025-05-20T05:39:47.229093Z [email protected] emulator -------- Too many processes
[error] 2025-05-20T05:39:47.229115Z [email protected] emulator -------- Too many processes
[error] 2025-05-20T05:39:47.229125Z [email protected] emulator -------- Too many processes
[error] 2025-05-20T05:39:47.229134Z [email protected] emulator -------- Too many processes
[error] 2025-05-20T05:39:47.229142Z [email protected] emulator -------- Too many processes
[error] 2025-05-20T05:39:47.229161Z [email protected] emulator -------- Too many processes
[error] 2025-05-20T05:39:47.229305Z [email protected] emulator -------- Error in process <0.73767013.0> on node '[email protected]' with exit value:
{system_limit,[{erlang,spawn_opt,[proc_lib,init_p,[<0.73767013.0>,[],gen,init_it,[gen_server,<0.73767013.0>,<0.73767013.0>,couch_work_queue,[{max_size,100000},{max_items,500}],[]]],[link,monitor]],[{error_info,#{module => erl_erts_errors}}]},{proc_lib,spawn_opt,4,[{file,"proc_lib.erl"},{line,192}]},{proc_lib,start_link,5,[{file,"proc_lib.erl"},{line,358}]},{couch_mrview_updater,start_update,4,[{file,"src/couch_mrview_updater.erl"},{line,26}]},{couch_index_updater,'-update/3-fun-4-',8,[{file,"src/couch_index_updater.erl"},{line,173}]},{couch_util,with_db,2,[{file,"src/couch_util.erl"},{line,559}]}]}

[error] 2025-05-20T05:39:47.229453Z [email protected] emulator -------- Error in process <0.73767013.0> on node '[email protected]' with exit value:
{system_limit,[{erlang,spawn_opt,[proc_lib,init_p,[<0.73767013.0>,[],gen,init_it,[gen_server,<0.73767013.0>,<0.73767013.0>,couch_work_queue,[{max_size,100000},{max_items,500}],[]]],[link,monitor]],[{error_info,#{module => erl_erts_errors}}]},{proc_lib,spawn_opt,4,[{file,"proc_lib.erl"},{line,192}]},{proc_lib,start_link,5,[{file,"proc_lib.erl"},{line,358}]},{couch_mrview_updater,start_update,4,[{file,"src/couch_mrview_updater.erl"},{line,26}]},{couch_index_updater,'-update/3-fun-4-',8,[{file,"src/couch_index_updater.erl"},{line,173}]},{couch_util,with_db,2,[{file,"src/couch_util.erl"},{line,559}]}]}

[error] 2025-05-20T05:39:47.229895Z [email protected] <0.75491598.0> -------- CRASH REPORT Process  (<0.75491598.0>) with 1 neighbors crashed with reason: system_limit at erlang:spawn_opt/4 <= proc_lib:spawn_opt/4(line:192) <= proc_lib:start_link/5(line:358) <= couch_file:open/2(line:72) <= couch_mrview_util:open_file/1(line:860) <= couch_mrview_index:open/2(line:123) <= couch_index:'-init/1-fun-0-'/3(line:76) <= couch_util:with_db/2(line:559); initial_call: {couch_index,init,['Argument__1']}, ancestors: [<0.75129833.0>], message_queue_len: 0, links: [<0.75129833.0>], dictionary: [{io_priority,{view_update,<<"shards/00000000-7fffffff/vessel_0945.17309...">>}}], trap_exit: false, status: running, heap_size: 1598, stack_size: 28, reductions: 1842
[error] 2025-05-20T05:39:47.230238Z [email protected] <0.75491598.0> -------- CRASH REPORT Process  (<0.75491598.0>) with 1 neighbors crashed with reason: system_limit at erlang:spawn_opt/4 <= proc_lib:spawn_opt/4(line:192) <= proc_lib:start_link/5(line:358) <= couch_file:open/2(line:72) <= couch_mrview_util:open_file/1(line:860) <= couch_mrview_index:open/2(line:123) <= couch_index:'-init/1-fun-0-'/3(line:76) <= couch_util:with_db/2(line:559); initial_call: {couch_index,init,['Argument__1']}, ancestors: [<0.75129833.0>], message_queue_len: 0, links: [<0.75129833.0>], dictionary: [{io_priority,{view_update,<<"shards/00000000-7fffffff/vessel_0945.17309...">>}}], trap_exit: false, status: running, heap_size: 1598, stack_size: 28, reductions: 1842
[error] 2025-05-20T05:39:47.230551Z [email protected] emulator -------- Error in process <0.69857375.0> on node '[email protected]' with exit value:
{{badmatch,{system_limit,[{erlang,spawn_opt,[proc_lib,init_p,[<0.75491598.0>,[<0.75129833.0>],gen,init_it,[gen_server,<0.75491598.0>,<0.75491598.0>,couch_file,{"./data/.shards/00000000-7fffffff/vessel_0945.1730977511_design/mrview/d1e62a1b1e90e57aa2bddcd0066e1db0.view",[nologifmissing],<0.75491598.0>,#Ref<0.934213240.2265710595.49597>},[]]],[link,monitor]],[{error_info,#{module => erl_erts_errors}}]},{proc_lib,spawn_opt,4,[{file,"proc_lib.erl"},{line,192}]},{proc_lib,start_link,5,[{file,"proc_lib.erl"},{line,358}]},{couch_file,open,2,[{file,"src/couch_file.erl"},{line,72}]},{couch_mrview_util,open_file,1,[{file,"src/couch_mrview_util.erl"},{line,860}]},{couch_mrview_index,open,2,[{file,"src/couch_mrview_index.erl"},{line,123}]},{couch_index,'-init/1-fun-0-',3,[{file,"src/couch_index.erl"},{line,76}]},{couch_util,with_db,2,[{file,"src/couch_util.erl"},{line,559}]}]}},[{ken_server,update_ddoc_views,4,[{file,"src/ken_server.erl"},{line,404}]},{ken_server,update_ddoc_indexes,3,[{file,"src/ken_server.erl"},{line,318}]},{ken_server,'-update_db_indexes/2-fun-1-',4,[{file,"src/ken_server.erl"},{line,276}]},{lists,foldl_1,3,[{file,"lists.erl"},{line,1599}]},{ken_server,update_db_indexes,2,[{file,"src/ken_server.erl"},{line,273}]}]}

[error] 2025-05-20T05:39:47.230857Z [email protected] emulator -------- Error in process <0.69857375.0> on node '[email protected]' with exit value:
{{badmatch,{system_limit,[{erlang,spawn_opt,[proc_lib,init_p,[<0.75491598.0>,[<0.75129833.0>],gen,init_it,[gen_server,<0.75491598.0>,<0.75491598.0>,couch_file,{"./data/.shards/00000000-7fffffff/vessel_0945.1730977511_design/mrview/d1e62a1b1e90e57aa2bddcd0066e1db0.view",[nologifmissing],<0.75491598.0>,#Ref<0.934213240.2265710595.49597>},[]]],[link,monitor]],[{error_info,#{module => erl_erts_errors}}]},{proc_lib,spawn_opt,4,[{file,"proc_lib.erl"},{line,192}]},{proc_lib,start_link,5,[{file,"proc_lib.erl"},{line,358}]},{couch_file,open,2,[{file,"src/couch_file.erl"},{line,72}]},{couch_mrview_util,open_file,1,[{file,"src/couch_mrview_util.erl"},{line,860}]},{couch_mrview_index,open,2,[{file,"src/couch_mrview_index.erl"},{line,123}]},{couch_index,'-init/1-fun-0-',3,[{file,"src/couch_index.erl"},{line,76}]},{couch_util,with_db,2,[{file,"src/couch_util.erl"},{line,559}]}]}},[{ken_server,update_ddoc_views,4,[{file,"src/ken_server.erl"},{line,404}]},{ken_server,update_ddoc_indexes,3,[{file,"src/ken_server.erl"},{line,318}]},{ken_server,'-update_db_indexes/2-fun-1-',4,[{file,"src/ken_server.erl"},{line,276}]},{lists,foldl_1,3,[{file,"lists.erl"},{line,1599}]},{ken_server,update_db_indexes,2,[{file,"src/ken_server.erl"},{line,273}]}]}

[error] 2025-05-20T05:39:47.230888Z [email protected] emulator -------- Too many processes
[error] 2025-05-20T05:39:47.230902Z [email protected] emulator -------- Too many processes
[warning] 2025-05-20T05:39:47.233668Z [email protected] <0.368.0> -------- mem3_distribution : node [email protected] down, reason: net_kernel_terminated
[error] 2025-05-20T05:39:47.233799Z [email protected] <0.38.0> -------- gen_server net_kernel terminated with reason: system_limit at erlang:spawn_opt/4 <= inet_tcp_dist:gen_setup/6(line:411) <= net_kernel:setup/5(line:1811) <= net_kernel:do_auto_connect_2/5(line:668) <= net_kernel:handle_info/2(line:974) <= gen_server:try_handle_info/3(line:1095) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241)
  last msg: redacted
     state: {state,'[email protected]',longnames,{tick,<0.40.0>,60000,4},7000,sys_dist,#{},#{},#{},[{listen,#Port<0.4>,<0.39.0>,{net_address,{{127,0,0,1},42261},"127.0.0.1",tcp,inet},inet_tcp_dist}],[],0,#{},net_sup,#{}}
    extra: []
[error] 2025-05-20T05:39:47.234138Z [email protected] <0.38.0> -------- gen_server net_kernel terminated with reason: system_limit at erlang:spawn_opt/4 <= inet_tcp_dist:gen_setup/6(line:411) <= net_kernel:setup/5(line:1811) <= net_kernel:do_auto_connect_2/5(line:668) <= net_kernel:handle_info/2(line:974) <= gen_server:try_handle_info/3(line:1095) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241)
  last msg: redacted
     state: {state,'[email protected]',longnames,{tick,<0.40.0>,60000,4},7000,sys_dist,#{},#{},#{},[{listen,#Port<0.4>,<0.39.0>,{net_address,{{127,0,0,1},42261},"127.0.0.1",tcp,inet},inet_tcp_dist}],[],0,#{},net_sup,#{}}
    extra: []
[error] 2025-05-20T05:39:47.234719Z [email protected] <0.349.0> -------- gen_server '[email protected]' terminated with reason: system_limit at erlang:spawn_opt/4 <= erlang:spawn_monitor/3 <= rexi_server:handle_cast/2(line:73) <= gen_server:try_handle_cast/3(line:1121) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241)
  last msg: redacted
     state: {st,#Ref<0.934213240.2211053577.41460>,#Ref<0.934213240.2211053577.41461>,{[],[]},0,0}
    extra: []
[error] 2025-05-20T05:39:47.235065Z [email protected] <0.349.0> -------- gen_server '[email protected]' terminated with reason: system_limit at erlang:spawn_opt/4 <= erlang:spawn_monitor/3 <= rexi_server:handle_cast/2(line:73) <= gen_server:try_handle_cast/3(line:1121) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241)
  last msg: redacted
     state: {st,#Ref<0.934213240.2211053577.41460>,#Ref<0.934213240.2211053577.41461>,{[],[]},0,0}
    extra: []
[error] 2025-05-20T05:39:47.235442Z [email protected] <0.38.0> -------- CRASH REPORT Process net_kernel (<0.38.0>) with 1 neighbors crashed with reason: system_limit at erlang:spawn_opt/4 <= inet_tcp_dist:gen_setup/6(line:411) <= net_kernel:setup/5(line:1811) <= net_kernel:do_auto_connect_2/5(line:668) <= net_kernel:handle_info/2(line:974) <= gen_server:try_handle_info/3(line:1095) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241); initial_call: {net_kernel,init,['Argument__1']}, ancestors: [net_sup,kernel_sup,<0.19.0>], message_queue_len: 1, links: [<0.40.0>,<0.35.0>], dictionary: [{longnames,true}], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 25290927
[error] 2025-05-20T05:39:47.235678Z [email protected] <0.38.0> -------- CRASH REPORT Process net_kernel (<0.38.0>) with 1 neighbors crashed with reason: system_limit at erlang:spawn_opt/4 <= inet_tcp_dist:gen_setup/6(line:411) <= net_kernel:setup/5(line:1811) <= net_kernel:do_auto_connect_2/5(line:668) <= net_kernel:handle_info/2(line:974) <= gen_server:try_handle_info/3(line:1095) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241); initial_call: {net_kernel,init,['Argument__1']}, ancestors: [net_sup,kernel_sup,<0.19.0>], message_queue_len: 1, links: [<0.40.0>,<0.35.0>], dictionary: [{longnames,true}], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 25290927
[error] 2025-05-20T05:39:47.235917Z [email protected] <0.349.0> -------- CRASH REPORT Process [email protected] (<0.349.0>) with 0 neighbors crashed with reason: system_limit at erlang:spawn_opt/4 <= erlang:spawn_monitor/3 <= rexi_server:handle_cast/2(line:73) <= gen_server:try_handle_cast/3(line:1121) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241); initial_call: {rexi_server,init,['Argument__1']}, ancestors: [rexi_server_sup,rexi_sup,<0.343.0>], message_queue_len: 67, links: [<0.346.0>], dictionary: [], trap_exit: false, status: running, heap_size: 4185, stack_size: 28, reductions: 332961192
[error] 2025-05-20T05:39:47.236087Z [email protected] <0.349.0> -------- CRASH REPORT Process [email protected] (<0.349.0>) with 0 neighbors crashed with reason: system_limit at erlang:spawn_opt/4 <= erlang:spawn_monitor/3 <= rexi_server:handle_cast/2(line:73) <= gen_server:try_handle_cast/3(line:1121) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241); initial_call: {rexi_server,init,['Argument__1']}, ancestors: [rexi_server_sup,rexi_sup,<0.343.0>], message_queue_len: 67, links: [<0.346.0>], dictionary: [], trap_exit: false, status: running, heap_size: 4185, stack_size: 28, reductions: 332961192
[error] 2025-05-20T05:39:47.236238Z [email protected] <0.346.0> -------- Supervisor rexi_server_sup had child '[email protected]' started with rexi_server:start_link('[email protected]') at <0.349.0> exit with reason system_limit at erlang:spawn_opt/4 <= erlang:spawn_monitor/3 <= rexi_server:handle_cast/2(line:73) <= gen_server:try_handle_cast/3(line:1121) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241) in context child_terminated
[error] 2025-05-20T05:39:47.236332Z [email protected] <0.346.0> -------- Supervisor rexi_server_sup had child '[email protected]' started with rexi_server:start_link('[email protected]') at <0.349.0> exit with reason system_limit at erlang:spawn_opt/4 <= erlang:spawn_monitor/3 <= rexi_server:handle_cast/2(line:73) <= gen_server:try_handle_cast/3(line:1121) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241) in context child_terminated
[error] 2025-05-20T05:39:47.236476Z [email protected] <0.35.0> -------- Supervisor net_sup had child net_kernel started with net_kernel:start_link(#{clean_halt => true,name => '[email protected]',name_domain => longnames,supervisor => net_sup}) at <0.38.0> exit with reason system_limit at erlang:spawn_opt/4 <= inet_tcp_dist:gen_setup/6(line:411) <= net_kernel:setup/5(line:1811) <= net_kernel:do_auto_connect_2/5(line:668) <= net_kernel:handle_info/2(line:974) <= gen_server:try_handle_info/3(line:1095) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241) in context child_terminated
[error] 2025-05-20T05:39:47.236638Z [email protected] <0.35.0> -------- Supervisor net_sup had child net_kernel started with net_kernel:start_link(#{clean_halt => true,name => '[email protected]',name_domain => longnames,supervisor => net_sup}) at <0.38.0> exit with reason system_limit at erlang:spawn_opt/4 <= inet_tcp_dist:gen_setup/6(line:411) <= net_kernel:setup/5(line:1811) <= net_kernel:do_auto_connect_2/5(line:668) <= net_kernel:handle_info/2(line:974) <= gen_server:try_handle_info/3(line:1095) <= gen_server:handle_msg/6(line:1183) <= proc_lib:init_p_do_apply/3(line:241) in context child_terminated
[error] 2025-05-20T05:39:47.236709Z [email protected] <0.35.0> -------- Supervisor net_sup had child net_kernel started with net_kernel:start_link(#{clean_halt => true,name => '[email protected]',name_domain => longnames,supervisor => net_sup}) at <0.38.0> exit with reason reached_max_restart_intensity in context shutdown
[error] 2025-05-20T05:39:47.236789Z [email protected] <0.35.0> -------- Supervisor net_sup had child net_kernel started with net_kernel:start_link(#{clean_halt => true,name => '[email protected]',name_domain => longnames,supervisor => net_sup}) at <0.38.0> exit with reason reached_max_restart_intensity in context shutdown
[error] 2025-05-20T05:39:47.236829Z [email protected] <0.22.0> -------- Supervisor kernel_sup had child net_sup started with erl_distribution:start_link() at <0.35.0> exit with reason shutdown in context child_terminated
[error] 2025-05-20T05:39:47.236863Z [email protected] <0.22.0> -------- Supervisor kernel_sup had child net_sup started with erl_distribution:start_link() at <0.35.0> exit with reason shutdown in context child_terminated
[error] 2025-05-20T05:39:47.236946Z [email protected] <0.22.0> -------- Supervisor kernel_sup had child net_sup started with erl_distribution:start_link() at <0.35.0> exit with reason reached_max_restart_intensity in context shutdown
[error] 2025-05-20T05:39:47.236982Z [email protected] <0.22.0> -------- Supervisor kernel_sup had child net_sup started with erl_distribution:start_link() at <0.35.0> exit with reason reached_max_restart_intensity in context shutdown
[error] 2025-05-20T05:39:47.240982Z [email protected] <0.68837790.0> 059726d8e3 req_err(3318763005) internal_server_error : No DB shards could be opened.
    [<<"fabric_util:get_shard/4 L133">>,<<"fabric:get_security/2 L217">>,<<"chttpd_auth_request:db_authorization_check/1 L109">>,<<"chttpd_auth_request:authorize_request/1 L19">>,<<"chttpd:handle_req_after_auth/2 L428">>,<<"chttpd:process_request/1 L410">>,<<"chttpd:handle_request_int/1 L345">>,<<"mochiweb_http:headers/6 L140">>]
[error] 2025-05-20T05:39:47.253594Z [email protected] <0.70309110.0> 2163077f12 req_err(3318763005) internal_server_error : No DB shards could be opened.
    [<<"fabric_util:get_shard/4 L133">>,<<"fabric:get_security/2 L217">>,<<"chttpd_auth_request:db_authorization_check/1 L109">>,<<"chttpd_auth_request:authorize_request/1 L19">>,<<"chttpd:handle_req_after_auth/2 L428">>,<<"chttpd:process_request/1 L410">>,<<"chttpd:handle_request_int/1 L345">>,<<"mochiweb_http:headers/6 L140">>]

Steps to Reproduce

Expected Behaviour

Your Environment

  • CouchDB version used: 3.5.0
  • Browser name and version: NA
  • Operating system and version: Debian 12

Additional Context

The server is a VM with 12 cors, and 62GB of ram

idea-christian avatar May 20 '25 13:05 idea-christian

@idea-christian Too many processes indicates you've hit the Erlang default process limit. If you have the memory and the CPU capacity (and it looks like you do) you can increase the limit.

The default is a bit low 131072 (though recent Erlang releases bumped that up to 1M or so).

You can increase the limit in the vm.args file add this to it: +P1000000

The limit will be the next power of 2 higher than that:

1> erlang:system_info(process_limit).
1048576

In production with larger clusters we use an even higher limit.

nickva avatar May 20 '25 15:05 nickva

Thanks @nickva , I will look into that and update the case here later.

We actually talked about setting up a cluster to distribute load (or setting up multiple single instances) our setup is maybe a bit special as we have a multiple small to medium dbs (like a few MB to <10Gb). They are not related to each other and they are always replicated with a remote server.

Would it be possible to upgrade a single node to a cluster? Would it actually make sense in our scenario to use a cluster?

PS: I can't thank you and everyone in the project enough for CouchDB its awesome to work with it 👍

idea-christian avatar May 20 '25 15:05 idea-christian

Thanks @nickva , I will look into that and update the case here later.

We discussed this a bit in our dev channel and decided to even raise the default limit to 1M in vm.args: https://github.com/apache/couchdb/pull/5545

We actually talked about setting up a cluster to distribute load (or setting up multiple single instances) our setup is maybe a bit special as we have a multiple small to medium dbs (like a few MB to <10Gb). They are not related to each other and they are always replicated with a remote server.

Either setup would work. As long as all dbs can fit on one instance I can see multiple single instances being simpler in one respect. A cluster could be needed once all the dbs cannot all fit one server any longer.

Would it be possible to upgrade a single node to a cluster? Would it actually make sense in our scenario to use a cluster?

That's possible but might be a bit tricky, you'd have to carefully manage shard node / names. I haven't done that recently so can't giver any good advice. If possible, maybe just replicating them to a new cluster could be simpler: replicate all dbs initially to a new cluster, stop the traffic from the old instance, replicate again to catch everything up, then switch the traffic over. It might imply a downtime during the catch up and switchover.

PS: I can't thank you and everyone in the project enough for CouchDB its awesome to work with it 👍

Np at all! Thanks for using CouchDB and for reaching out. It's much appreciated and it helps! us find out and fix issues!

nickva avatar May 20 '25 20:05 nickva

Thanks for your feedback, I think we are doing something wrong on our end. It's a mango index/view heavy setup (200+ of them by database with a lot of DB's) to support the logic we need.

Currently, we are re-creating the databases that had too many documents in the 10m+ space (the majority was log data that we don't need because the other side that replicates to this server was running in debug mode). I noticed that some databases show the below warning in Fauxton, and I wondered if this can be an issue and how to "clean" them (if needed).

Image

  • We group multiple views in design documents (based on if they belong together from a domain knowledge perspective) is this a good idea?
  • We use partitioned databases, does it make to short circuit views by checking the incoming document id to only "run" for the "correct" partition?
  • Any other advice given to reduce system load, or general learnings / things we might have missed in the documentation?

We seem still to have issues, I could not look into the log as it was ~50 GB and none of my editors wanted to handle that. The log level is already set to warning only.

I got the crash dump and uploaded it to the GIST below, unfortunately I can't read much out of this, to be honest.

https://gist.github.com/idea-christian/905c0c0a2db179aa57fc6b95f0b55526

Update

I did adjust the systemd job as described here, I guess thats what the log wants to tell me: https://docs.couchdb.org/en/stable/maintenance/performance.html#maximum-open-file-descriptors-ulimit Meanwhile I was able to grab some more log (which is arleady 800mb again): this is the top of the log file:

[error] 2025-05-21T09:07:33.271893Z [email protected] <0.12177941.0> -------- Could not open file ./data/.shards/80000000-ffffffff/vessel_1285.1731666409_design/mrview/f6834bc01f3a0066de7e6af2234424ab.view: too many open files
[error] 2025-05-21T09:07:33.271936Z [email protected] <0.12177941.0> -------- Failed to open view file './data/.shards/80000000-ffffffff/vessel_1285.1731666409_design/mrview/f6834bc01f3a0066de7e6af2234424ab.view': too many open files
[error] 2025-05-21T09:07:33.273003Z [email protected] emulator -------- Error in process <0.12171346.0> on node '[email protected]' with exit value:
{{badmatch,{error,emfile}},[{ken_server,update_ddoc_views,4,[{file,"src/ken_server.erl"},{line,404}]},{ken_server,update_ddoc_indexes,3,[{file,"src/ken_server.erl"},{line,318}]},{ken_server,'-update_db_indexes/2-fun-1-',4,[{file,"src/>

[error] 2025-05-21T09:07:33.273949Z [email protected] emulator -------- Error in process <0.12171346.0> on node '[email protected]' with exit value:
{{badmatch,{error,emfile}},[{ken_server,update_ddoc_views,4,[{file,"src/ken_server.erl"},{line,404}]},{ken_server,update_ddoc_indexes,3,[{file,"src/ken_server.erl"},{line,318}]},{ken_server,'-update_db_indexes/2-fun-1-',4,[{file,"src/>

[error] 2025-05-21T09:07:34.081446Z [email protected] <0.12176411.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:34.081519Z [email protected] <0.12176411.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:34.087933Z [email protected] <0.12176411.0> -------- CRASH REPORT Process  (<0.12176411.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:34.088175Z [email protected] <0.12176411.0> -------- CRASH REPORT Process  (<0.12176411.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:34.088213Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:34.088227Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:34.304010Z [email protected] <0.12176949.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:34.304036Z [email protected] <0.12176949.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:34.304184Z [email protected] <0.12176949.0> -------- CRASH REPORT Process  (<0.12176949.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:34.304348Z [email protected] <0.12176949.0> -------- CRASH REPORT Process  (<0.12176949.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:34.304412Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:34.304424Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:34.763327Z [email protected] <0.12177592.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:34.763390Z [email protected] <0.12177592.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:34.763693Z [email protected] <0.12177592.0> -------- CRASH REPORT Process  (<0.12177592.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:34.764019Z [email protected] <0.12177592.0> -------- CRASH REPORT Process  (<0.12177592.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:34.764211Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:34.764384Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.118021Z [email protected] <0.12178572.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.118045Z [email protected] <0.12178572.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.118451Z [email protected] <0.12178572.0> -------- CRASH REPORT Process  (<0.12178572.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.118545Z [email protected] <0.12178572.0> -------- CRASH REPORT Process  (<0.12178572.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.118568Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.118584Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.243412Z [email protected] <0.12177057.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.243442Z [email protected] <0.12177057.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.243608Z [email protected] <0.12177057.0> -------- CRASH REPORT Process  (<0.12177057.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.243845Z [email protected] <0.12177057.0> -------- CRASH REPORT Process  (<0.12177057.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.243868Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.244741Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.262524Z [email protected] <0.12176933.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.262557Z [email protected] <0.12176933.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.262716Z [email protected] <0.12176933.0> -------- CRASH REPORT Process  (<0.12176933.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.262804Z [email protected] <0.12176933.0> -------- CRASH REPORT Process  (<0.12176933.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.349986Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.350011Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.433114Z [email protected] <0.12176255.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.433152Z [email protected] <0.12176255.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.433305Z [email protected] <0.12176255.0> -------- CRASH REPORT Process  (<0.12176255.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.433502Z [email protected] <0.12176255.0> -------- CRASH REPORT Process  (<0.12176255.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.450338Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.450370Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.536413Z [email protected] <0.12179066.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.536441Z [email protected] <0.12179066.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.537535Z [email protected] <0.12179066.0> -------- CRASH REPORT Process  (<0.12179066.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.537935Z [email protected] <0.12179066.0> -------- CRASH REPORT Process  (<0.12179066.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.550946Z [email protected] <0.12176497.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.550978Z [email protected] <0.12176497.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"
[error] 2025-05-21T09:07:35.551121Z [email protected] <0.12176497.0> -------- CRASH REPORT Process  (<0.12176497.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.551204Z [email protected] <0.12176497.0> -------- CRASH REPORT Process  (<0.12176497.0>) with 0 neighbors exited with reason: {error,accept_failed} at mochiweb_acceptor:init/4(line:71) <= proc_lib:init_p_do>
[error] 2025-05-21T09:07:35.551224Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.551565Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.654135Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.654160Z [email protected] <0.399.0> -------- {mochiweb_socket_server,383,{acceptor_error,{error,accept_failed}}}
[error] 2025-05-21T09:07:35.712625Z [email protected] <0.12179251.0> -------- application: mochiweb, "Accept failed error", "{error,emfile}"

idea-christian avatar May 21 '25 08:05 idea-christian

too many open files points to ulimit -n or SystemD equivalent not being set high enough.

janl avatar May 21 '25 11:05 janl

We group multiple views in design documents (based on if they belong together from a domain knowledge perspective) is this a good idea?

This is a good idea for your setup.

We use partitioned databases, does it make to short circuit views by checking the incoming document id to only "run" for the "correct" partition?

What is your partition key? View updates will by per-shard, not per-partition.

Any other advice given to reduce system load, or general learnings / things we might have missed in the documentation?

For this we need more info about your hardware, data sizes and shapes, and traffic patterns.

noticed that some databases show the below warning in Fauxton, and I wondered if this can be an issue and how to "clean" them (if needed).

The easiest fix today is replicating the db into a new db with a replication filter (mango selector preferred) that drops deleted docs. There is some tooling that can help with that (-f option) if you have a lot of dbs to go through. Alternatively you can purge all those docs in a script, but there are some subtleties to get right about index updates, and purge batch sizes. Best avoid that if can be. — if you migrate to a cluster as Nick outlined, you can drop the deleted docs there. — Finally, if you can live with this for a little while longer, we are aiming to ship a feature that lets CouchDB automatically prune those deleted docs for you. No promises on availability, but we are actively working on it. — the main impact this will have for you is new indexes being added on those databases and replication peers starting from scratch. If you don’t have either of those, just wait for a future release.

janl avatar May 21 '25 12:05 janl

Thanks for coming back. I adjusted the Systemd job as indicated in the documentation, no idea why I missed this initially...

What is your partition key? View updates will by per-shard, not per-partition.

The partition keys are for topics/types in the data, for example, there is one partition key there is used for all task items in the system and another one for the history created by the tasks. All partition keys are following a pattern of:

ClientNumber-SystemPart:UUID

So for example: 0000-Task:1e79d26a-bac4-49bc-9816-07b240a5abfe

I think it would be fair to say that we currently use the partition keys a bit like tables in SQL. Some views/the related map function have code like this:

function (doc) {
 if(doc._id.indexOf('-certificate_detail:') == -1) return;
 emit(doc.FleetMasterDetailID, 1);
}

The idea was to kick-out not needed documents directly. Does this help at all? I guess our problem are more the mango indexes, we have a lot to support multiple filters (guess is 700).

For this we need more info about your hardware, data sizes and shapes, and traffic patterns.

Currently, its a single node, VM running Debian with 12 cores, and 62GB of RAM. The database range from a few MB to around 5GB, where the bigger onces size is due to thumbnails attached to documents. Each database is in replication with another server running on the client side. Beside that CouchDB is used through our API to server the web frontend, this is mostly views and mango queries. We also use full text search with clouseau.

Regarding the deleted documents, thanks for the details. I guess in that case we wait, as we don't plan to add new indexes right now and adding further DB's to the system (given the load issues) sounds not like a good idea.

I'm just surprised that we run into this issues now as this is not a new server, and even with me removing a lot of the unneeded clutter it still seems to have issues. Again, any help is appreciated as I'm a bit clueless what todo next.

Update

We talked about this here again, and "maybe" found a reason but we are unsure given the current documentation. Until now we have used the /{db}/_index endpoint and periodically (every 5 min) asked each database to create each index (around 700,again) because we assumed that it does nothing when it already exists given that the documentation says: "200 Index created successfully or already exists". Can this be related?

Update 2

While the server overall is still stable after the change mentioned in the above update, we now have the logs below inside that are new to me. If somebody could help with this and also the above question regarding the _index endpoint it would be great .

[error] 2025-05-22T14:25:43.499955Z [email protected] emulator -------- Error in process <0.142500288.0> on node '[email protected]' with exit value:
{{badmatch,{'EXIT',noproc}},[{dreyfus_index_updater,count_pending_purged_docs_since,2,[{file,"src/dreyfus_index_updater.erl"},{line,128}]},{dreyfus_index_updater,update,2,[{file,"src/dreyfus_index_updater.erl"},{line,34}]}]}

[error] 2025-05-22T14:25:43.500257Z [email protected] emulator -------- Error in process <0.142500288.0> on node '[email protected]' with exit value:
{{badmatch,{'EXIT',noproc}},[{dreyfus_index_updater,count_pending_purged_docs_since,2,[{file,"src/dreyfus_index_updater.erl"},{line,128}]},{dreyfus_index_updater,update,2,[{file,"src/dreyfus_index_updater.erl"},{line,34}]}]}

idea-christian avatar May 21 '25 12:05 idea-christian

A quick update from a Slack discussion:

use of partitions with these data sizes and on a single node will not yield any benefits. IF this design is meant to scale 100x and a cluster, I would suggest reducing partition keys to just the userid but exclude types (as the docs state).

The other issues with process and file limits are resolved.

There has been one more crash with lots of logs that could not be retained in the moment so we are waiting for that to happen again to see what caused it.

And for research, we want to make sure that POST /_index is really-really idempotent and doesn’t cause any side-effects when calling 700 times every five minutes (it shouldn’t, but I don’t know OTOH)

janl avatar May 23 '25 09:05 janl

Disclaimer: I'm not good at reading Erlang and this is the first time I looked into the CouchDB code. I did run my checks with the help of Copilot to help me identify relevant parts. All my checks were made on the current main branch, I did not check other branches. I was unable to compile and debug the code.

In short:

I think it's save to call the db/_index endpoint via POST with the same index again, nothing will happen in the database because on how Erlang compares objects. The duplication check is more implicit than explicit in the code.

The details:

  1. The request is handled by mango_httpd:handle_index_req
  2. mango_httpd:handle_index_req tries to load the current ddoc from the database
  3. mango_idx:add is called that then calls (for JSON index) mango_idx_view:add
  4. Due to how the data structure is created, the mango_idx:add returns the same object (for an existing) index as already loaded by mango_httpd:handle_index_req
  5. The code in mango_httpd:handle_index_req returns with an exists and mango_crud:insert is not called

Dev-Owl avatar May 25 '25 07:05 Dev-Owl

this is likely just a fd limit exhaustion. Please re-open if more evidence for other issues arises.

janl avatar Jul 04 '25 14:07 janl