dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

test_redis_replication_all fails

Open kostasrim opened this issue 1 year ago • 5 comments

See https://github.com/dragonflydb/dragonfly/actions/runs/12764368205/job/35576390254#step:6:822

            await c_master.close()
            await c_replica.close()
>           assert timeout > 0, "Timeout while waiting for replica to sync"
E           AssertionError: Timeout while waiting for replica to sync
E           assert 0 > 0

kostasrim avatar Jan 14 '25 09:01 kostasrim

https://github.com/dragonflydb/dragonfly/actions/runs/13022114433/job/36324777965#step:6:1177

kostasrim avatar Jan 30 '25 08:01 kostasrim

This hasn't failed for some time. If it does fail I will reopen.

kostasrim avatar Feb 20 '25 11:02 kostasrim

This test fails again. https://github.com/dragonflydb/dragonfly/actions/runs/14528984004/job/40765540786

Errors:

Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
E20250418 03:40:38.767102 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
E20250418 03:40:39.267544 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
E20250418 03:40:39.767937 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
E20250418 03:40:40.268354 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
E20250418 03:40:40.768642 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
E20250418 03:40:41.269273 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
E20250418 03:40:41.769763 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
E20250418 03:40:41.810784 22127 protocol_client.cc:230] Could not shutdown socket system:107

Warnings:

W20250418 03:39:53.190562 22125 dfly_main.cc:786] SWAP is enabled. Consider disabling it when running Dragonfly.
W20250418 03:39:53.450364 22127 rdb_load.cc:2479] Unrecognized RDB AUX field: 'valkey-ver'
W20250418 03:39:53.450423 22127 rdb_load.cc:2479] Unrecognized RDB AUX field: 'aof-base'
W20250418 03:40:37.153556 22127 main_service.cc:1363]  REPLCONF GETACK * failed with reason: Replicating a replica is unsupported
W20250418 03:40:38.266096 22127 protocol_client.cc:59] Socket error system:103
W20250418 03:40:38.266132 22127 common.cc:413] ReportError: Software caused connection abort
W20250418 03:40:38.266285 22127 replica.cc:259] Error stable sync with localhost:5586 system:103 Software caused connection abort
E20250418 03:40:38.767102 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
W20250418 03:40:38.767315 22127 replica.cc:211] Error connecting to localhost:5586 system:111
E20250418 03:40:39.267544 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
W20250418 03:40:39.267583 22127 replica.cc:211] Error connecting to localhost:5586 system:111
E20250418 03:40:39.767937 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
W20250418 03:40:39.767978 22127 replica.cc:211] Error connecting to localhost:5586 system:111
E20250418 03:40:40.268354 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
W20250418 03:40:40.268404 22127 replica.cc:211] Error connecting to localhost:5586 system:111
E20250418 03:40:40.768642 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
W20250418 03:40:40.768683 22127 replica.cc:211] Error connecting to localhost:5586 system:111
E20250418 03:40:41.269273 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
W20250418 03:40:41.269313 22127 replica.cc:211] Error connecting to localhost:5586 system:111
E20250418 03:40:41.769763 22127 protocol_client.cc:193] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
W20250418 03:40:41.769821 22127 replica.cc:211] Error connecting to localhost:5586 system:111
W20250418 03:40:41.810752 22127 common.cc:413] ReportError: Operation canceled: ExecutionState cancelled
E20250418 03:40:41.810784 22127 protocol_client.cc:230] Could not shutdown socket system:107
W20250418 03:40:42.270139 22127 replica.cc:211] Error connecting to localhost:5586 generic:125

vyavdoshenko avatar Apr 18 '25 05:04 vyavdoshenko

one more time https://github.com/dragonflydb/dragonfly/actions/runs/14913001208

BorysTheDev avatar May 09 '25 06:05 BorysTheDev

one more time https://github.com/dragonflydb/dragonfly/actions/runs/15041074295/job/42272722352

BagritsevichStepan avatar May 16 '25 05:05 BagritsevichStepan

https://github.com/dragonflydb/dragonfly/actions/runs/17064796321/job/48379371916 (for epoll)

vyavdoshenko avatar Aug 19 '25 10:08 vyavdoshenko

https://github.com/dragonflydb/dragonfly/actions/runs/17937145378/job/51005345978 epoll failure

abhijat avatar Sep 24 '25 05:09 abhijat

https://github.com/dragonflydb/dragonfly/actions/runs/17937145378/job/51005345978 epoll failure

In this instance, the error is during await replicate_all(c_replicas, master.port). The function wait_available_async is called on the replica client, and it times out waiting for a successful ping to the client.

On the replica logs, around five attempts are seen where a ping is rejected:

I20250923 06:21:54.624557 22617 main_service.cc:1292] Command PING not executed because global state is LOADING
..
I20250923 06:21:54.634847 22617 dragonfly_connection.cc:856] Closed connection for peer id=1 addr=127.0.0.1:46252 laddr=127.0.0.1:5587 fd=8 name=1 tid=0 irqmatch=0 age=0 idle=0 tot-cmds=4 tot-net-in=168 tot-read-calls=4 tot-dispatches=0 db=0 flags=t phase=preclose lib-name=redis-py lib-ver=5.2.1
...
I20250923 06:21:54.649621 22617 main_service.cc:1292] Command PING not executed because global state is LOADING
...
I20250923 06:21:54.668099 22617 main_service.cc:1292] Command PING not executed because global state is LOADING
...
I20250923 06:21:54.689772 22617 main_service.cc:1292] Command PING not executed because global state is LOADING
I20250923 06:21:54.691730 22617 replica.cc:1129] header: FULLRESYNC 92009043ccad762c7476bdb213de562e47e5f906 54386
I20250923 06:21:54.691751 22617 replica.cc:1138] master repl_id 92009043ccad762c7476bdb213de562e47e5f906 / 54386
...
I20250923 06:21:54.718245 22617 replica.cc:1157] token: EOF:6b2f7cb08dbed6b7ab7e6cbac94d8ec3fd0498d6
I20250923 06:21:54.718289 22617 replica.cc:1161] Token: 6b2f7cb08dbed6b7ab7e6cbac94d8ec3fd0498d6
I20250923 06:21:54.718302 22617 replica.cc:438] Starting full sync with Redis master
I20250923 06:21:54.718364 22617 main_service.cc:1419] Executing dangerous command FLUSHALL (null-conn)
I20250923 06:21:54.718480 22617 dragonfly_connection.cc:1888] [5] LaunchAsyncFiberIfNeeded
I20250923 06:21:54.718544 22617 db_slice.cc:932] Flushing db 65535

Note that around the last attempt a full sync begins.

I20250923 06:21:54.842031 22617 replica.cc:474] Comparing token 6b2f7cb08dbed6b7ab7e6cbac94d8ec3fd0498d6
I20250923 06:21:54.842080 22617 main_service.cc:2808] Switching state from LOADING to ACTIVE
I20250923 06:21:54.842141 22617 main_service.cc:1292] Command PING not executed because global state is LOADING
I20250923 06:21:54.842240 22617 engine_shard.cc:372] This heartbeat step took 124ms
I20250923 06:21:54.894316 22617 replica.cc:668] Before reading repl-log
I20250923 06:21:54.894361 22617 replica.cc:675] Transitioned into stable sync
W20250923 06:24:17.842257 22617 protocol_client.cc:60] Socket error generic:103

The replica transitions into stable sync just as the last ping attempt is rejected. But there are no further ping attempts (which should have succeeded) and the test eventually times out, note the gap between the last two lines is nearly two minutes which is the timeout.

~~Also interesting is that after the last ping attempt is rejected, the connection from the redis client does not appear to close like it did in the last attempts, ie we do not see a log like Closed connection for peer id~~ see next comment - it is closed much later

abhijat avatar Oct 13 '25 05:10 abhijat

The connection id for the connection which got stuck on the last ping seems to be 5, searching logs for it

I20250923 06:21:54.700819 22617 dragonfly_connection.cc:750] [5] HandleRequests
I20250923 06:21:54.718480 22617 dragonfly_connection.cc:1888] [5] LaunchAsyncFiberIfNeeded
I20250923 06:21:54.842141 22617 main_service.cc:1292] Command PING not executed because global state is LOADING
I20250923 06:24:17.843137 22617 dragonfly_connection.cc:1365] [5] Got event 8197 1 0 system:0
I20250923 06:24:17.843158 22617 dragonfly_connection.cc:2070] [5] Connection::breaker_cb_ 8197
I20250923 06:24:17.845412 22617 dragonfly_connection.cc:856] Closed connection for peer id=5 addr=127.0.0.1:46280 laddr=127.0.0.1:5587 fd=8 name=5 tid=0 irqmatch=0 age=143 idle=0 tot-cmds=1 tot-net-in=124 tot-read-calls=2 tot-dispatches=2 db=0 flags=t phase=preclose lib-name=redis-py lib-ver=5.2.1

It is closed much later after the ping command is rejected. It seems something got stuck with that client after the PING not executed log.

the event 8197 seems to be EPOLL_IN | EPOLL_OUT | EPOLLRDHUP (1, 4, 8192) which triggers error callback in util::fb2::EpollSocket::Wakey

abhijat avatar Oct 13 '25 05:10 abhijat

https://github.com/dragonflydb/dragonfly/actions/runs/17064796321/job/48379371916 (for epoll)

In this case the error is during the final check_data call:

E               redis.exceptions.ConnectionError: Connection closed by server.

The server seems to have crashed

5579➜    @           0x5b573b        272  dfly::MultiCommandSquasher::Execute()
5579➜    @           0x55a33b        128  dfly::Service::DispatchManyCommands()::{lambda()#1}::operator()()
5579➜    @           0x55aa79        480  dfly::Service::DispatchManyCommands()
5579➜    @          0x1351888        464  facade::Connection::SquashPipeline()
5579➜    @          0x1352a50        400  facade::Connection::AsyncFiber()
5579➜    @          0x135425b         32  facade::Connection::LaunchAsyncFiberIfNeeded()::{lambda()#1}::operator()()
5579➜    @          0x1359cd6         32  std::__invoke_impl<>()
5579➜    @          0x13599b9         32  std::__invoke<>()
5579➜    @          0x1359752         32  std::__apply_impl<>()
5579➜    @          0x135978c         48  std::apply<>()
5579➜    @          0x1359809         96  util::fb2::detail::WorkerFiberImpl<>::run_()
5579➜    @          0x1358e93         80  util::fb2::detail::WorkerFiberImpl<>::WorkerFiberImpl<>()::{lambda()#1}::operator()()
5579➜    @          0x135a5bc         80  std::__invoke_impl<>()
5579➜    @          0x135a3e5         80  std::__invoke<>()
5579➜    @          0x135a2ba         80  std::invoke<>()
5579➜    @          0x135a08c         80  boost::context::detail::fiber_record<>::run()
5579➜    @          0x1359dea         48  boost::context::detail::fiber_entry<>()
5579➜    @     0x7f4d00c781cf  (unknown)  make_fcontext
24816:M 19 Aug 2025 09:24:14.025 * Connection with replica 127.0.0.1:5579 lost.
E20250819 09:24:14.893441 24835 protocol_client.cc:194] Error while calling sock_->Connect(server_context_.endpoint): Software caused connection abort

It does not look like the core dump was captured.

abhijat avatar Oct 13 '25 06:10 abhijat

logs older than https://github.com/dragonflydb/dragonfly/issues/4451#issuecomment-3200140396 are no longer accessible

abhijat avatar Oct 13 '25 06:10 abhijat

I can reproduce a somewhat similar effect locally, by rejecting each PING command and observing the difference between the time it takes for dragonfly to respond:

I20251013 17:07:05.170480 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.176245 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.181468 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.186866 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.191576 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.196043 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.200026 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.204668 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.209335 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.214330 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.219556 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.224776 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
---
I20251013 17:07:16.369480 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:28.820564 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:41.290228 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:53.707533 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:08:06.038409 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:08:07.410672 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE

While the first several commands are almost within a second, at one point the command processing gets much slower, almost every 10 seconds or more.

If we try to correlate this with the main replication fiber:

...
I20251013 17:07:05.209335 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.214330 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.219556 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.224776 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:05.226022 268170 epoll_socket.cc:599] sock[9] Wakey: Schedule read in main_replication
I20251013 17:07:16.369480 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:28.820564 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:41.290228 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:07:53.707533 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:08:06.038409 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:08:07.410672 268170 main_service.cc:1301] Command PING not executed because global state is ACTIVE

The log related to main replication fiber appears soon after we transition into stable sync, in the loop where we read responses from master:

I20251013 17:07:05.226022 268170 epoll_socket.cc:599] sock[9] Wakey: Schedule read in main_replication
I20251013 17:07:05.226032 268170 replica.cc:719] Got command SELECT
I20251013 17:07:05.241410 268170 replica.cc:719] Got command SETRANGE
I20251013 17:07:05.256443 268170 replica.cc:719] Got command SADD
I20251013 17:07:05.271334 268170 replica.cc:719] Got command APPEND
I20251013 17:07:05.286151 268170 replica.cc:719] Got command LPUSH
I20251013 17:07:05.301239 268170 replica.cc:719] Got command SADD
I20251013 17:07:05.316139 268170 replica.cc:719] Got command LPUSH
I20251013 17:07:05.330931 268170 replica.cc:719] Got command APPEND
I20251013 17:07:05.345654 268170 replica.cc:719] Got command APPEND
I20251013 17:07:05.360827 268170 replica.cc:719] Got command APPEND
I20251013 17:07:05.375752 268170 replica.cc:719] Got command ZPOPMIN
I20251013 17:07:05.390681 268170 replica.cc:719] Got command LPOP
I20251013 17:07:05.405429 268170 replica.cc:719] Got command LPOP
I20251013 17:07:05.420665 268170 replica.cc:719] Got command ZADD
I20251013 17:07:05.436010 268170 replica.cc:719] Got command ZADD
I20251013 17:07:05.451520 268170 replica.cc:719] Got command ZADD
I20251013 17:07:05.466920 268170 replica.cc:719] Got command HINCRBY
I20251013 17:07:05.481746 268170 replica.cc:719] Got command APPEND
I20251013 17:07:05.496531 268170 replica.cc:719] Got command LPUSH
I20251013 17:07:05.511369 268170 replica.cc:719] Got command ZADD
I20251013 17:07:05.526353 268170 replica.cc:719] Got command APPEND
I20251013 17:07:05.541150 268170 replica.cc:719] Got command APPEND
I20251013 17:07:05.556195 268170 replica.cc:719] Got command LPOP
I20251013 17:07:05.571135 268170 replica.cc:719] Got command SADD
...
...

It seems like once we start processing commands in stable sync phase, the connection fiber which responds to ping commands does not get to run very often, as we have proactor_threads=1.

abhijat avatar Oct 13 '25 11:10 abhijat

By naively yielding the fiber every few received commands, as in

diff --git a/src/server/replica.cc b/src/server/replica.cc
index c812f0da..4ea82bfa 100644
--- a/src/server/replica.cc
+++ b/src/server/replica.cc
@@ -697,6 +697,7 @@ error_code Replica::ConsumeRedisStream() {

   acks_fb_ = fb2::Fiber("redis_acks", &Replica::RedisStreamAcksFb, this);

+  size_t x = 0;
   while (true) {
     auto response = ReadRespReply(&io_buf, /*copy_msg=*/false);
     if (!response.has_value()) {
@@ -716,7 +717,7 @@ error_code Replica::ConsumeRedisStream() {
       // Dragonfly disallows some commands, such as SELECT, inside of MULTI/EXEC, so here we simply
       // ignore MULTI/EXEC and execute their inner commands individually.
       if (!absl::EqualsIgnoreCase(cmd, "MULTI") && !absl::EqualsIgnoreCase(cmd, "EXEC")) {
-        VLOG(2) << "Got command " << cmd << "\n consumed: " << response->total_read;
+        VLOG_EVERY_N(2, 10000) << "Got command " << cmd;

         if (LastResponseArgs()[0].GetBuf()[0] == '\r') {
           for (const auto& arg : LastResponseArgs()) {
@@ -727,6 +728,11 @@ error_code Replica::ConsumeRedisStream() {
         facade::RespExpr::VecToArgList(LastResponseArgs(), &args_vector);
         CmdArgList arg_list{args_vector.data(), args_vector.size()};
         service_.DispatchCommand(arg_list, &null_builder, &conn_context);
+
+        if (++x % 1000 == 0) {
+          ThisFiber::Yield();
+          x = 0;
+        }
       }
     }

The ping response becomes much more responsive

I20251013 17:20:26.324806 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:26.438584 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:26.551772 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:26.664197 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:26.777375 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:26.889222 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:27.001827 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:27.114288 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:27.229979 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:27.343152 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:27.456777 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:27.571981 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:27.684424 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:27.798259 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:27.911111 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:28.025056 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:28.135571 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:28.246429 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:28.356876 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:28.467079 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:28.578392 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:28.689047 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE
I20251013 17:20:28.800676 276633 main_service.cc:1301] Command PING not executed because global state is ACTIVE

The responses now appear fairly quickly

abhijat avatar Oct 13 '25 11:10 abhijat

I'm not quite clear yet why this yield is needed for faster responses though, since reading commands reads from the socket, should it not implicitly yield once in a while if data is not available on the socket?

The fiber for replication is only woken up once after stable sync ( as per the logs ) though, and it looks like it proceeds to read all the commands after that.

abhijat avatar Oct 13 '25 11:10 abhijat

@abhijat I think you are observing the different behaviour of each of the socket implementations:

For iouring && RecvMsg

  396   while (true) {                                                                        
  397     FiberCall fc(p, timeout());                                                         
  398     fc->PrepRecvMsg(fd, &msg, flags);                                                   
  399     fc->sqe()->flags |= register_flag();                                                
  400                                                                                         
  401     // As described in "io_uring_prep_recv(3)"                                          
  402     if (has_pollfirst_ && !has_recv_data_) {                                            
  403       fc->sqe()->ioprio |= IORING_RECVSEND_POLL_FIRST;                                  
  404     }                                                                                   
  405                                                                                         
  406     res = fc.Get(); 

We fiber block suspend unconditionally on each call.

For epoll this doesn't seem to be true (I could be wrong):

  395 auto EpollSocket::RecvMsg(const msghdr& msg, int flags) -> Result<size_t> {
  404   do {
  405     if (fd_ & IS_SHUTDOWN) {
  406       ec = make_error_code(errc::connection_aborted);
  407       break;
  408     }
  409   
  410     ssize_t res = recvmsg(fd, const_cast<msghdr*>(&msg), flags);
  411     if (res > 0) {  // if res is 0, that means a peer closed the socket.
  412       return res;
  413     }
  414     
  415     if (res == 0) {
  416       ec = make_error_code(errc::connection_aborted);
  417       break;
  418     } 
  425     PendingReq req(&read_req_);
  426     ec = req.Suspend(timeout());

recvmsg shall return immediately if there is no data without suspension

kostasrim avatar Oct 13 '25 15:10 kostasrim

@kostasrim this seems pretty much what is happening. I added a log in the res > 0 branch in epoll socket and that log keeps printing from the start of stable sync state until the test ends, so the recv call chain never lets the other fiber run.

On the other hand in the same test with uring socket the ping command is able to run and be rejected much more often even during the stable sync phase as it suspends regularly.

abhijat avatar Oct 14 '25 06:10 abhijat

Should we yield or suspend in epoll socket every once in a while just like uring socket then? It doesn't make sense to monopolize the shard in one fiber even if there is a constant stream of data to read

abhijat avatar Oct 14 '25 06:10 abhijat

Should we yield or suspend in epoll socket every once in a while just like uring socket then? It doesn't make sense to monopolize the shard in one fiber even if there is a constant stream of data to read

Yes I think the solution is to patch helio and epoll socket. In fact, I think we should yield every time -- just like we do with iouring even if there is data available to be read such that we are "cooperating". Otherwise, you can't get cooperative multitasking without cooporerating 🤣

kostasrim avatar Oct 14 '25 06:10 kostasrim

This issue does not affect release mode, its only seen in debug mode.

abhijat avatar Oct 15 '25 12:10 abhijat