tile38 Bulk line ending issue upgrading from tile38 1.29.1 to 1.33.4

Describe the bug Upgrade from tile38 version 1.29.1 to 1.33.4. Did an upgrade by adding 1.33.4 nodes to the existing cluster and removing old nodes one node at a time, ensuring nodes were caught up before removing old nodes. To replace the master did a failover from sentinel and then removed the old master. After two hours of the nodes being stable and caught up, noticed that latency had significantly increased. Noticed the error “follow: Protocol error: invalid bulk line ending” on the follower nodes. Had to bring down the follower nodes to stabilize the cluster. Every time we try to bring up follower nodes we get the same error. If I copy the AOF to a new cluster and add followers the error is not there. Are there any steps we can take to allow following on the existing cluster.

To Reproduce Steps to reproduce the behavior:

have a 1.29.1 cluster
Add all 1.33.4 nodes making sure they’re caught up
Remove old 1.29.1 followers
Failover master to a new node and remove old master node
Confirm everything is caught up
2 hours later errors start

Expected behavior When follower nodes are added should connect to master and be caught up

Logs

Nov 15 '24 02:11 ssajnani

Found more details on this issue, seems this issue arises after version 1.32.2, we have had to revert to this version before looking for solutions for upgrading further.

Jan 11 '25 04:01 samar-sajnani

To be clear going from 1.32.2 to 1.32.3+ breaks your cluster? What OS are you using?

Jan 11 '25 04:01 tidwall

Essentially when the version is 1.33.0 the new nodes added to the cluster fail to catch up, and keep getting bulk line ending issue. Note this is a cluster managing leader with sentinel nodes. Using AL2 from AWS on arm64 arch.

Jan 11 '25 06:01 samar-sajnani

In addition, when a node is master with v1.33.0 and a sentinel failover is triggered, the failover doesnt work smoothly in a live cluster and sometimes requires a restart of one of the nodes. Either the followers or master hang. We had to revert to 1.32.2, it seems that in 1.33.0 protobuf and net library were updated.

Jan 11 '25 06:01 samar-sajnani

Thanks for the additional context. That's very helpful. I took a look at the diff between the two 1.32.2 and 1.33 and didn't find any obvious change that would cause a break like that. Some dependencies were updated, as you mentioned, and so was the workflow file to use the latest Go version to 1.22. I don't recall encountering problems when upgrading those dependencies before.

I'll try to reproduce the issue on my side.

Are you using the Docker builds?

Jan 13 '25 14:01 tidwall

Not using docker builds. The way to fix the bulk line ending issue and the failover issue we were facing was to downgrade in the following way, we did it during maintenance:

Add a slave node that was version 1.32.2
Copy over the AOF file from the 1.33+ version node to the new node, restart the service for tile38, without this it wouldn't catch up without a ton of bulk line ending and other protocol/parsing errors
Once the new node is caught up, failover using sentinel
After failover, the 1.33+ previous master node becomes unresponsive/stuck, the new node becomes master
Add another 1.32.2 node which catches up without any errors
Remove the 1.33+ node

Note: I also went through AOF formatting and did an AOF validity check, our AOF was passing the check.

Jan 13 '25 19:01 samar-sajnani

Unfortunately I still haven't been able to reproduce the issue. I tried upgrading and downgrading between 1.32, 1.33 and 1.34 as described above. I also tried with other things that I thought might cause the issue, but to no avail.

Jan 22 '25 12:01 tidwall

Hi, during our (Samar and mine; same company) experiments I had created the following test scripts which downloaded and installed the specified version of tile38. This was to iterate over each version, trying to find where the error message pops up.

I am attaching the zip file here.

In the experiment, I had installed redis on the machine (to get redis-sentinel) and disabled any auto starting things of the redis systemd services.

HashiCorp Consul is also needed for the first time leader election logic in the script. (start it another shell using consul agent -dev) You can skip the Consul bits if you manually assign the leader and followers.

The experiment in brief would be to run the start_tile38_server.bash with different ports (10001, 10002, 10003) on the same machine.

run the start_sentinel.bash <1|2|3> with the sentinel id as a parameter,

Then start the tile38-benchmark in another shell.

Observe the outputs of all the three tile38-servers.

Further observer when you press CTRL+C for the current leader, how the sentinels do their thing to switch leader and how the error messages pop up.

On this single machine test setup, versions 1.33.1 (one thirty-three one) and onwards caused the error log quite predictably:


   _____ _ _     ___ ___
  |_   _|_| |___|_  | . |  Tile38 1.33.1 (4a1c800b) 64 bit (amd64/linux)
    | | | | | -_|_  | . |  Port: 10002, PID: 108999
    |_| |_|_|___|___|___|  tile38.com

2024/12/13 23:22:06 [INFO] Server started, Tile38 version 1.33.1, git 4a1c800b
2024/12/13 23:22:06 [DEBU] Geom indexing: QuadTree (64 points)
2024/12/13 23:22:06 [DEBU] Multi indexing: RTree (64 points)
2024/12/13 23:22:06 [INFO] Ready to accept connections at [::]:10002
2024/12/13 23:22:06 [INFO] AOF loaded 0 commands: 0.00s, 0/s, 0 bytes/s
2024/12/13 23:22:06 [DEBU] Opened connection: 127.0.0.1:60224
2024/12/13 23:22:06 [DEBU] Detached connection: 127.0.0.1:60224
2024/12/13 23:22:06 [DEBU] Opened connection: 127.0.0.1:60216
2024/12/13 23:22:06 [INFO] live 127.0.0.1:60224
2024/12/13 23:22:06 [DEBU] pubsub open
2024/12/13 23:22:06 [DEBU] Opened connection: 127.0.0.1:60238
2024/12/13 23:22:06 [DEBU] Detached connection: 127.0.0.1:60238
2024/12/13 23:22:06 [INFO] live 127.0.0.1:60238
2024/12/13 23:22:06 [DEBU] pubsub open
2024/12/13 23:22:06 [DEBU] Opened connection: 127.0.0.1:60232
2024/12/13 23:22:06 [DEBU] Opened connection: 127.0.0.1:60254
2024/12/13 23:22:06 [DEBU] Opened connection: 127.0.0.1:60264
2024/12/13 23:22:06 [DEBU] Detached connection: 127.0.0.1:60264
2024/12/13 23:22:06 [INFO] live 127.0.0.1:60264
2024/12/13 23:22:06 [DEBU] pubsub open
2024/12/13 23:22:16 [INFO] following new host '127.0.0.1' '10001'.
2024/12/13 23:22:16 [DEBU] follow:127.0.0.1:10001:check some
2024/12/13 23:22:16 [DEBU] follow:127.0.0.1:10001:replconf
2024/12/13 23:22:16 [INFO] following new host '127.0.0.1' '10001'.
2024/12/13 23:22:16 [DEBU] follow:127.0.0.1:10001:read aof
2024/12/13 23:22:16 [DEBU] follow:127.0.0.1:10001:check some
2024/12/13 23:22:16 [DEBU] follow:127.0.0.1:10001:replconf
2024/12/13 23:22:16 [DEBU] follow:127.0.0.1:10001:read aof
2024/12/13 23:22:17 [ERRO] follow: Protocol error: invalid bulk line ending
2024/12/13 23:22:18 [DEBU] follow:127.0.0.1:10001:check some
2024/12/13 23:22:18 [DEBU] follow: aof fully intact
2024/12/13 23:22:18 [DEBU] follow:127.0.0.1:10001:replconf
2024/12/13 23:22:18 [DEBU] follow:127.0.0.1:10001:read aof
2024/12/13 23:22:20 [ERRO] follow: Protocol error: invalid bulk line ending
2024/12/13 23:22:21 [DEBU] follow:127.0.0.1:10001:check some
2024/12/13 23:22:21 [DEBU] follow: aof fully intact
2024/12/13 23:22:21 [DEBU] follow:127.0.0.1:10001:replconf
2024/12/13 23:22:21 [DEBU] follow:127.0.0.1:10001:read aof
2024/12/13 23:22:24 [INFO] caught up

tile38_testing.zip

Jan 22 '25 16:01 shantanugadgil

Hi Shantanu,

After a little tweaking I was able to get the tests scripts to run.

I don't run Consul, so I manually handle the initial follow commands.
I had to use 1.33.2. Tile38 1.33.1 doesn't work with the start_sentinel.bash because there was a bug in that version where role json response was bad. This was fixed in 1.33.2 https://github.com/tidwall/tile38/commit/100c4dd837e6c8848a69443bf2a8246adeeec6b0.

Once I got the cluster running I started pumping data into it, then killed the leader, and let the sentinel switch to a new leader. I then began pumping data to the new leader, and switched on the previously killed server and let the sentinel bring the cluster back up to a three node system. All while the data was still pushing to the new leader.

I did get some "Protocol error: invalid bulk line ending" errors like shown, but the follower remedied the issue and finally caught up.

The output you show has the "[INFO] caught up" in the end too, which should mean that the follower is fixed and ready to use.

This isn't exactly the same issue that Samar described.

In Samar's output there're additional errors such as "follow: invalid argument 'obje*3'". It looks like in that case there may have been a corruption in network transit or on disk, but it's difficult to tell. I also noticed "aof shrink ended" and "reloading aof commands". These indicate that the follower may have ran an aofshrink before restarting the process.

I tried the same, ran AOFSHRINK then killed the follower and restarted. And sure enough I got some parsing errors:

2025/01/23 14:08:04 [WARN] truncating aof to 27181600
2025/01/23 14:08:04 [INFO] reloading aof commands
2025/01/23 14:08:05 [INFO] AOF loaded 313446 commands: 0.57s, 553341/s, 46 MB/s
2025/01/23 14:08:05 [DEBU] follow:127.0.0.1:10003:replconf
2025/01/23 14:08:05 [DEBU] follow:127.0.0.1:10003:read aof
2025/01/23 14:08:06 [ERRO] follow: strconv.ParseInt: parsing "*3": invalid syntax
2025/01/23 14:08:07 [DEBU] follow:127.0.0.1:10003:check some
2025/01/23 14:08:07 [DEBU] follow: aof fully intact
2025/01/23 14:08:07 [DEBU] follow:127.0.0.1:10003:replconf
2025/01/23 14:08:07 [DEBU] follow:127.0.0.1:10003:read aof
2025/01/23 14:08:07 [ERRO] follow: strconv.ParseInt: parsing "9\r*3": invalid syntax

That went on for 22 seconds until:

2025/01/23 14:08:29 [DEBU] follow: aof fully intact
2025/01/23 14:08:29 [DEBU] follow:127.0.0.1:10003:replconf
2025/01/23 14:08:29 [DEBU] follow:127.0.0.1:10003:read aof
2025/01/23 14:08:29 [INFO] caught up

The follower ended up healing itself and the cluster is back up again.

Did the follower in the original issue, as described by Samar, ever get "caught up"? And if so, how long did it take. If not, did the parse error messages just continue for a very long time before killing the process?

Jan 23 '25 21:01 tidwall

I did get some "Protocol error: invalid bulk line ending" errors like shown, but the follower remedied the issue and finally caught up.

The output you show has the "[INFO] caught up" in the end too, which should mean that the follower is fixed and ready to use.

This isn't exactly the same issue that Samar described.

Thanks for clarifying!

Did the follower in the original issue, as described by Samar, ever get "caught up"? And if so, how long did it take. If not, did the parse error messages just continue for a very long time before killing the process?

I think we can discuss internally and post replies here for further investigation.

Jan 24 '25 13:01 shantanugadgil

@tidwall the follower in some cases did catch up but other times it would keep loading the AOF perpetually, it would keep pulling the AOF file then having shrink issues, then it would start again from scratch, I noticed that the AOF file would go from being 1GB to being almost complete (~6-8GB) then back to 1 GB, sometimes taking minutes up to hours. If I left it running over night some of the followers added would eventually join but it's not guaranteed.

Also, there's really two issues here:

the follower taking a long time to follow
if a new node is added and a failover is performed the 1.33+ node that was previously the master becomes unresponsive for a while, long enough to cause errors from our client side, this goes on for up to 30 minutes, usually I have to restart the 1.33+ tile38 service for it to become available for requests

I think you guys have seen the first issue let me recreate the second and describe it here step-by-step.

Jan 24 '25 14:01 samar-sajnani