[2.x] Influx replication (OSS 2.4) stops working but still returns 204 status codes
Steps to reproduce:
- Installed Influx OSS 2.4 on local edge computer and cloud enterprise.
- Set up remote and replication to cloud
Expected behavior: Local Influx to replicate to cloud.
Actual behavior:
-
Local influx replicates to cloud for a while (sometimes even days). For an unknown reason it will stop replicating but still show status code 204 when viewing the "influx replication list" on the CLI.
-
Current Queue bytes continues to grow indefinitely.
-
It would be helpful if there was a way to restart the replication without having to delete and create it again.
Environment info: SYSTEM INFO OS Name: Microsoft Windows 10 Pro OS Version: 10.0.19044 N/A Build 19044 System Model: HP EliteDesk 800 G2 DM 35W System Type: x64-based PC Processor(s): 1 Processor(s) Installed. [01]: Intel64 Family 6 Model 94 Stepping 3 GenuineIntel ~2496 Mhz BIOS Version: HP N21 Ver. 02.22, 3/27/2017 System Locale: en-us;English (United States) Input Locale: N/A Time Zone: (UTC-06:00) Central Time (US & Canada) Total Physical Memory: 15,785 MB Available Physical Memory: 11,949 MB Virtual Memory: Max Size: 18,217 MB Virtual Memory: Available: 14,386 MB Virtual Memory: In Use: 3,831 MB
INFLUX VERSION C:\windows\system32>influxd version InfluxDB v2.4.0 (git: de247bab08) build_date: 2022-08-18T19:41:41Z
A bit of findings.
It turns out the replication doesn't stop working, it just falls behind significantly. (data is written to the buffer faster than it's replicated up to the cloud). So it creates a substantial lag between OSS and cloud Influx. (approximately 6h lag in the last 24 hours).
Bandwidth shouldn't be the issue since I'm writing three tags every 10 seconds.
Any updates on this? I am having the same problems, are there workarounds existing?
Same, I asked about this last week: https://community.influxdata.com/t/replication-stream-throttled/30553
In my case, I have a 5 days delay.
This is my current replication directory:
root@a83ab38d7827:/var/lib/influxdb2/engine/replicationq/0b57d84662e23000# ls -lah
total 322M
drwxr-xr-x 2 influxdb influxdb 4.0K Jul 12 14:51 .
drwxr-xr-x 3 influxdb root 4.0K Jun 12 11:35 ..
-rw------- 1 influxdb influxdb 11M Jul 7 10:53 10
-rw------- 1 influxdb influxdb 11M Jul 7 11:03 11
-rw------- 1 influxdb influxdb 11M Jul 7 11:12 12
-rw------- 1 influxdb influxdb 11M Jul 7 11:29 13
-rw------- 1 influxdb influxdb 11M Jul 7 12:07 14
-rw------- 1 influxdb influxdb 11M Jul 7 12:47 15
-rw------- 1 influxdb influxdb 11M Jul 7 13:34 16
-rw------- 1 influxdb influxdb 11M Jul 7 13:54 17
-rw------- 1 influxdb influxdb 11M Jul 10 10:00 18
-rw------- 1 influxdb influxdb 11M Jul 10 10:11 19
-rw------- 1 influxdb influxdb 11M Jul 10 10:21 20
-rw------- 1 influxdb influxdb 11M Jul 10 10:32 21
-rw------- 1 influxdb influxdb 11M Jul 10 10:43 22
-rw------- 1 influxdb influxdb 11M Jul 10 10:54 23
-rw------- 1 influxdb influxdb 11M Jul 10 11:05 24
-rw------- 1 influxdb influxdb 11M Jul 10 11:16 25
-rw------- 1 influxdb influxdb 11M Jul 10 11:27 26
-rw------- 1 influxdb influxdb 11M Jul 10 11:39 27
-rw------- 1 influxdb influxdb 11M Jul 10 13:03 28
-rw------- 1 influxdb influxdb 11M Jul 10 15:16 29
-rw------- 1 influxdb influxdb 11M Jul 10 18:16 30
-rw------- 1 influxdb influxdb 11M Jul 11 09:36 31
-rw------- 1 influxdb influxdb 11M Jul 11 10:19 32
-rw------- 1 influxdb influxdb 11M Jul 11 10:30 33
-rw------- 1 influxdb influxdb 11M Jul 11 10:41 34
-rw------- 1 influxdb influxdb 11M Jul 11 10:52 35
-rw------- 1 influxdb influxdb 11M Jul 11 11:03 36
-rw------- 1 influxdb influxdb 11M Jul 11 11:15 37
-rw------- 1 influxdb influxdb 11M Jul 11 12:04 38
-rw------- 1 influxdb influxdb 11M Jul 12 09:47 39
-rw------- 1 influxdb influxdb 11M Jul 12 14:51 40
-rw------- 1 influxdb influxdb 1.3M Jul 12 15:05 41
-rw------- 1 influxdb influxdb 11M Jul 12 15:05 9
Even for me its stuck there but i see 203 status code in influx... this has happened when one of the replication machine was down for some time... FYI: i'm running replication between 2 systems kind of bidirectional..
this is my replication info:
# influx replication list
ID Name Org ID Remote ID Local Bucket ID Remote Bucket ID Remote Bucket Name Remaining Bytes to be Synced Current Queue Bytes on Disk Max Queue Bytes Latest Status Code Drop Non-Retryable Data
0c5ff2a10dd6b000 replication1 6e7eedec8b5eb87e 0c5ff2a0efacd000 6984d18380ffab41 6984d18380ffab41 4799248505 4801750271 30000000000 204 false
Have you tried 2.7/2.latest?
If you are an enterprise customer, please ask through your support contact. @aortuno-liberty
We are running 2.7.1 for months and are experiencing this issue.
I see minor changes between 2.7.1 and latest related to the replication. I can try to upgrade but I´m quite positive it's going to be the same behavior.