dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

Failed to pull DflyConnection from remote_ready_queue

Open dg-korolev opened this issue 1 year ago • 7 comments

Describe the bug Tonight we got a crash problem on dragonfly instance. This issue may be related to previously reported issue: https://github.com/dragonflydb/dragonfly/issues/2599

dragonfly-1  | I20240222 00:44:07.197453    13 save_stages_controller.cc:283] Saving "dump-2024-02-22T00:44:00-summary.dfs" finished after 7 s
dragonfly-1  | W20240222 00:44:08.000803     8 transaction.cc:58] TxQueue is too long. Tx count:174, armed:93, runnable:0, total locks: 19, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:1089489810 continuation_tx: BRPOPLPUSH@967305287/2 {id=6892}  armed
dragonfly-1  | W20240222 00:44:09.000679    16 transaction.cc:58] TxQueue is too long. Tx count:166, armed:75, runnable:0, total locks: 18, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:1023973961
dragonfly-1  | W20240222 00:44:09.000916    15 transaction.cc:58] TxQueue is too long. Tx count:176, armed:86, runnable:0, total locks: 15, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:752259126
dragonfly-1  | W20240222 00:44:09.000931    14 transaction.cc:58] TxQueue is too long. Tx count:172, armed:78, runnable:0, total locks: 14, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:922722963
dragonfly-1  | W20240222 00:44:09.000897    13 transaction.cc:58] TxQueue is too long. Tx count:160, armed:66, runnable:0, total locks: 21, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:971144512
dragonfly-1  | W20240222 00:44:09.000893    10 transaction.cc:58] TxQueue is too long. Tx count:126, armed:35, runnable:0, total locks: 24, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:912110770 continuation_tx: BRPOPLPUSH@967310621/2 {id=14743}  armed
dragonfly-1  | W20240222 00:44:09.000903     9 transaction.cc:58] TxQueue is too long. Tx count:144, armed:54, runnable:0, total locks: 17, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:801155636
dragonfly-1  | W20240222 00:44:09.000910    12 transaction.cc:58] TxQueue is too long. Tx count:173, armed:83, runnable:0, total locks: 18, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:901691102 continuation_tx: BRPOPLPUSH@967310620/2 {id=14596}  armed
dragonfly-1  | W20240222 00:44:09.000947    17 transaction.cc:58] TxQueue is too long. Tx count:206, armed:114, runnable:0, total locks: 16, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:1018598879 continuation_tx: BRPOPLPUSH@967310620/2 {id=14596}  armed
dragonfly-1  | W20240222 00:44:11.000659    11 transaction.cc:58] TxQueue is too long. Tx count:171, armed:73, runnable:0, total locks: 14, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:837878273 continuation_tx: BRPOPLPUSH@967319900/2 {id=26317}
dragonfly-1  | W20240222 00:44:18.001056     8 transaction.cc:58] TxQueue is too long. Tx count:180, armed:89, runnable:0, total locks: 19, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:1089548305 continuation_tx: ZREM@967352863/1:9 {id=28558}
dragonfly-1  | W20240222 00:44:19.000793    14 transaction.cc:58] TxQueue is too long. Tx count:131, armed:46, runnable:0, total locks: 14, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:922770119 continuation_tx: RPOPLPUSH@967358211/2:7 {id=65019}
dragonfly-1  | W20240222 00:44:19.000793     9 transaction.cc:58] TxQueue is too long. Tx count:144, armed:54, runnable:0, total locks: 17, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:801203048
dragonfly-1  | W20240222 00:44:19.000875    10 transaction.cc:58] TxQueue is too long. Tx count:114, armed:24, runnable:0, total locks: 24, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:912155384 continuation_tx: ZRANGEBYSCORE@967358212/1:1 {id=1675}
dragonfly-1  | W20240222 00:44:19.000793    13 transaction.cc:58] TxQueue is too long. Tx count:125, armed:33, runnable:0, total locks: 21, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:971192000
dragonfly-1  | W20240222 00:44:19.000814    17 transaction.cc:58] TxQueue is too long. Tx count:176, armed:86, runnable:0, total locks: 16, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:1018648615
dragonfly-1  | W20240222 00:44:19.000815    16 transaction.cc:58] TxQueue is too long. Tx count:140, armed:46, runnable:0, total locks: 18, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:1024025187
dragonfly-1  | W20240222 00:44:19.000815    12 transaction.cc:58] TxQueue is too long. Tx count:147, armed:57, runnable:0, total locks: 18, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:901742488
dragonfly-1  | W20240222 00:44:19.000799    15 transaction.cc:58] TxQueue is too long. Tx count:140, armed:45, runnable:0, total locks: 15, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:752297090
dragonfly-1  | W20240222 00:44:21.002529    11 transaction.cc:58] TxQueue is too long. Tx count:97, armed:28, runnable:0, total locks: 14, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:837920993
dragonfly-1  | W20240222 00:44:28.000692     8 transaction.cc:58] TxQueue is too long. Tx count:168, armed:78, runnable:0, total locks: 19, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:1089607535
dragonfly-1  | W20240222 00:44:29.000649    17 transaction.cc:58] TxQueue is too long. Tx count:174, armed:78, runnable:0, total locks: 16, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:1018702120
dragonfly-1  | W20240222 00:44:29.000862     9 transaction.cc:58] TxQueue is too long. Tx count:133, armed:50, runnable:0, total locks: 17, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:801249272
dragonfly-1  | W20240222 00:44:29.000885    16 transaction.cc:58] TxQueue is too long. Tx count:131, armed:37, runnable:0, total locks: 18, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:1024080133
dragonfly-1  | W20240222 00:44:29.001067    14 transaction.cc:58] TxQueue is too long. Tx count:133, armed:68, runnable:0, total locks: 14, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:922825635
dragonfly-1  | W20240222 00:44:29.001266    12 transaction.cc:58] TxQueue is too long. Tx count:151, armed:62, runnable:0, total locks: 18, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:901794357 continuation_tx: RPOPLPUSH@967410848/2:7 {id=42178}
dragonfly-1  | W20240222 00:44:29.001277    13 transaction.cc:58] TxQueue is too long. Tx count:134, armed:42, runnable:0, total locks: 21, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:971243770
dragonfly-1  | W20240222 00:44:29.001331    15 transaction.cc:58] TxQueue is too long. Tx count:145, armed:47, runnable:0, total locks: 15, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:752339155
dragonfly-1  | W20240222 00:44:29.001358    10 transaction.cc:58] TxQueue is too long. Tx count:125, armed:27, runnable:0, total locks: 24, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:912208224
dragonfly-1  | W20240222 00:44:31.000830    11 transaction.cc:58] TxQueue is too long. Tx count:118, armed:23, runnable:0, total locks: 14, contended locks: 0
dragonfly-1  | max contention score: 0, lock: , poll_executions:837968323
dragonfly-1  | F20240222 00:44:34.094962    11 fiber_interface.cc:303] Failed to pull DflyConnection from remote_ready_queue. res=0 remotenext: 140703315138560 remotempty: 0
dragonfly-1  | * Check failure stack trace: *
dragonfly-1  |     @     0x5578a0569a03  google::logmessage::SendToLog()
dragonfly-1  |     @     0x5578a05621c7  google::logmessage::Flush()
dragonfly-1  |     @     0x5578a0563b4f  google::logmessagefatal::~LogMessageFatal()
dragonfly-1  |     @     0x5578a0337ed5  util::fb2::detail::fiberinterface::PullMyselfFromRemoteReadyQueue()
dragonfly-1  |     @     0x5578a03301d4  util::fb2::eventcount::wait_until()
dragonfly-1  |     @     0x55789fe78a8b  dfly::transaction::batonbarrier::Wait()
dragonfly-1  |     @     0x55789fe7c4c0  dfly::transaction::WaitOnWatch()
dragonfly-1  |     @     0x55789fe17e14  dfly::(anonymous namespace)::bpoppusher::Run()
dragonfly-1  |     @     0x55789fe18256  dfly::(anonymous namespace)::BRPopLPush()
dragonfly-1  |     @     0x55789fe6486f  dfly::commandid::Invoke()
dragonfly-1  |     @     0x55789fc7020c  dfly::service::InvokeCmd()
dragonfly-1  |     @     0x55789fc74d71  dfly::service::DispatchCommand()
dragonfly-1  |     @     0x55789ff53f3d  facade::connection::DispatchCommand()
dragonfly-1  |     @     0x55789ff54164  facade::connection::ParseRedis()
dragonfly-1  |     @     0x55789ff5658f  facade::connection::IoLoop()
dragonfly-1  |     @     0x55789ff56a83  facade::connection::ConnectionFlow()
dragonfly-1  |     @     0x55789ff57a5f  facade::connection::HandleRequests()
dragonfly-1  |     @     0x5578a034351d  util::listenerinterface::RunSingleConnection()
dragonfly-1  |     @     0x5578a0343995  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberENS0_21basic_fixedsize_stackINS0_12stack_traitsEEEZN4util3fb26detail15WorkerFiberImplIZNS8_17ListenerInterface13RunAcceptLoopEvEUlvE0_JEEC4IS7_EESt17basic_string_viewIcSt11char_traitsIcEERKNS0_12preallocatedEOT_OSD_EUlOS4_E_EEEEvNS1_10transfer_tE
dragonfly-1  |     @     0x5578a038686f  make_fcontext
dragonfly-1  | * SIGABRT received at time=1708562674 on cpu 3 *
dragonfly-1  | PC: @     0x7ff87fb6800b  (unknown)  raise
dragonfly-1  | [failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7ff87fb47941 while already in AbslFailureSignalHandler()
dragonfly-1  | * SIGSEGV received at time=1708562674 on cpu 3 *
dragonfly-1  | PC: @     0x7ff87fb47941  (unknown)  abort
dragonfly-1  | I20240222 00:45:11.232707     1 init.cc:70] dragonfly running in opt mode.
dragonfly-1  | I20240222 00:45:11.232919     1 dfly_main.cc:627] Starting dragonfly df-1e06c63-1e06c63727ffe84d2b4f9c36c6d5122082f52ae0
dragonfly-1  | I20240222 00:45:11.233206     1 dfly_main.cc:671] maxmemory has not been specified. Deciding myself....
dragonfly-1  | I20240222 00:45:11.233234     1 dfly_main.cc:680] Found 82.87GiB available memory. Setting maxmemory to 66.30GiB
dragonfly-1  | W20240222 00:45:11.233429     1 dfly_main.cc:354] Weird error 1 switching to epoll
dragonfly-1  | * Logs will be written to the first available of the following paths:
dragonfly-1  | /tmp/dragonfly.*
dragonfly-1  | ./dragonfly.*
dragonfly-1  | * For the available flags type dragonfly [--help | --helpfull]
dragonfly-1  | * Documentation can be found at: https://www.dragonflydb.io/docs
dragonfly-1  | I20240222 00:45:11.311951     1 proactor_pool.cc:146] Running 10 io threads
dragonfly-1  | I20240222 00:45:11.325162     1 server_family.cc:659] Host OS: Linux 5.10.0-28-amd64 x86_64 with 10 threads
dragonfly-1  | I20240222 00:45:11.327026     1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/data"
dragonfly-1  | I20240222 00:45:11.330243     1 server_family.cc:822] Loading /data/dump-2024-02-22T00:44:00-summary.dfs
dragonfly-1  | I20240222 00:45:11.331840     9 listener_interface.cc:101] sock[23] AcceptServer - listening on port 6379
dragonfly-1  | I20240222 00:45:21.040669    12 server_family.cc:876] Load finished, num keys read: 4802585

Environment (please complete the following information):

  • OS: Debian 11 64bit
  • Kernel: 5.10.0-28-amd64 # SMP Debian 5.10.209-2 (2024-01-31) x86_64 GNU/Linux
  • Containerized:
    • Docker version 25.0.3, build 4debf41
    • Docker Compose version v2.24.5
    • docker-compose.yml:
version: '3.8'
services:
  dragonfly:
    restart: always
    image: 'ghcr.io/dragonflydb/dragonfly-weekly:1e06c63727ffe84d2b4f9c36c6d5122082f52ae0-ubuntu'
    ulimits:
      memlock: -1
    ports:
      - "6379:6379"
    command: [
      #'--default_lua_flags=allow-undeclared-keys disable-atomicity',
      '--default_lua_flags=allow-undeclared-keys',
      '--snapshot_cron=*/1 * * * *',
      '--proactor_threads=10',
      '--lua_auto_async=true',
      '--cluster_mode=emulated',
      '--lock_on_hashtags=true'
    ]
    volumes:
      - dragonflydata:/data
    logging:
      driver: 'json-file'
      options:
          max-file: '2'
          max-size: 150m
volumes:
  dragonflydata
  • Dragonfly Version: dragonfly-weekly:1e06c63727ffe84d2b4f9c36c6d5122082f52ae0-ubuntu

Additional context

Server

redis_version:6.2.11 dragonfly_version:df-1e06c63 redis_mode:cluster arch_bits:64 os:Linux 5.10.0-28-amd64 x86_64 multiplexing_api:epoll tcp_port:6379 thread_count:10 uptime_in_seconds:22952 uptime_in_days:0

Clients

connected_clients:17126 client_read_buffer_bytes:64649472 blocked_clients:2379 dispatch_queue_entries:715

Memory

used_memory:5069151624 used_memory_human:4.72GiB used_memory_peak:6047369968 used_memory_peak_human:5.63GiB used_memory_rss:8625655808 used_memory_rss_human:8.03GiB used_memory_peak_rss:9128116224 comitted_memory:8665546768 maxmemory:71206184550 maxmemory_human:66.32GiB object_used_memory:4652220536 type_used_memory_STRING:130084768 type_used_memory_LIST:112017112 type_used_memory_SET:128 type_used_memory_ZSET:8919504 type_used_memory_HASH:4401199024 table_used_memory:461137616 num_buckets:856140 num_entries:4069884 inline_keys:1 listpack_blobs:18446744073709551541 listpack_bytes:18446744073709551245 small_string_bytes:177402880 pipeline_cache_bytes:585527830 dispatch_queue_bytes:155854306 dispatch_queue_subscriber_bytes:0 dispatch_queue_peak_bytes:994844136 client_read_buffer_peak_bytes:71593216 cache_mode:store maxmemory_policy:noeviction

Stats

total_connections_received:49584 total_commands_processed:399834239 instantaneous_ops_per_sec:20779 total_pipelined_commands:19666025 pipelined_latency_usec:23117808172290 total_net_input_bytes:162161664744 connection_migrations:0 total_net_output_bytes:44692164412 instantaneous_input_kbps:-1 instantaneous_output_kbps:-1 rejected_connections:-1 expired_keys:26588 evicted_keys:0 hard_evictions:0 garbage_checked:265 garbage_collected:1 bump_ups:0 stash_unloaded:19590 oom_rejections:0 traverse_ttl_sec:0 delete_ttl_sec:0 keyspace_hits:46561511 keyspace_misses:92144233 keyspace_mutations:265666683 total_reads_processed:107250651 total_writes_processed:211931196 defrag_attempt_total:0 defrag_realloc_total:0 defrag_task_invocation_total:0 reply_count:211931196 reply_latency_usec:350785827 blocked_on_interpreter:751 ram_hits:0 ram_misses:0

Replication

role:master connected_slaves:0 master_replid:246b1cc8d00249bffee4a9a798fa9b71fc5fd938

Modules

module:name=ReJSON,ver=20000,api=1,filters=0,usedby=[search],using=[],options=[handle-io-errors] module:name=search,ver=20000,api=1,filters=0,usedby=[],using=[ReJSON],options=[handle-io-errors]

Keyspace

db0:keys=4069884,expires=2549,avg_ttl=-1

Cpu

used_cpu_sys:37006.527869 used_cpu_user:40330.558211 used_cpu_sys_children:0.0 used_cpu_user_children:0.4102 used_cpu_sys_main_thread:3116.513383 used_cpu_user_main_thread:3675.871304

Cluster

cluster_enabled:1

dg-korolev avatar Feb 22 '24 09:02 dg-korolev

@dg-korolev that's disappointing. I am looking into it.

romange avatar Feb 22 '24 11:02 romange

@dg-korolev do you run in cloud or on-prem?

romange avatar Feb 22 '24 11:02 romange

@romange It's running on-premise, on one's own hardware

dg-korolev avatar Feb 22 '24 11:02 dg-korolev

I will build another custom binary. it won't solve the issue but will provide more insights if this happens again.

romange avatar Feb 22 '24 11:02 romange

@dg-korolev please use this image docker pull ghcr.io/dragonflydb/dragonfly-weekly:fbb0a9e2f138913e98cf117cbbbbcd1aeb9b9afb-ubuntu

romange avatar Feb 22 '24 15:02 romange

Thanks, the image is running. I continue to monitor this 👍

dg-korolev avatar Feb 22 '24 16:02 dg-korolev

from v1.15.0:

 E20240311 10:35:58.513952  1805 protocol_client.cc:297] Could not shutdown socket system:107
 E20240311 10:35:58.531046  1805 replica.cc:191] Error connecting to 192.168.112.60:9999 generic:125           
 E20240311 10:41:33.562984  1806 scheduler.cc:323] Failed to pull active fiber from remote_ready_queue, iteration 0 remote_empty: 0, next:281472604971776                                
 E20240311 10:41:33.563086  1806 scheduler.cc:327] Next fiber next is: 0xffff70060b00, usecount2               
 F20240311 10:41:33.563102  1806 fiber_interface.cc:303] Failed to pull DflyConnection from remote_ready_queue. res=0 remotenext: 281472604971776 remotempty: 0                          
 *** Check failure stack trace: ***  
     @     0xaaaad7218144  google::LogMessage::SendToLog()                                                     
     @     0xaaaad7211228  google::LogMessage::Flush()                                                         
     @     0xaaaad7212b64  google::LogMessageFatal::~LogMessageFatal()                                         
     @     0xaaaad704922c  util::fb2::detail::FiberInterface::PullMyselfFromRemoteReadyQueue()

romange avatar Mar 11 '24 10:03 romange

Critical issues were fixed in 1.16.0

romange avatar Apr 09 '24 11:04 romange