openmptcprouter icon indicating copy to clipboard operation
openmptcprouter copied to clipboard

Starlink wan gradually degrades to extremely low bandwidth when connected to OMR, only recoverable by resetting interface

Open ioogithub opened this issue 2 years ago • 13 comments

Expected Behavior

Starlink wan will remain available and operate with normal bandwidth when connected to OMR.

Current Behavior

When Starlink is connected to OMR (via wan1) the bandwidth available seems to degrade over time and eventually becomes very low (0.01Mb). The only way to recovery from this state is to reset the wan interface. After reset the bandwidth is normal again (30-50Mb).

When starlink is in the state, the connection shows normal on the System->OpenMPTCProuter->Status page (green checkmark).

Possible Solution

Temporary solution is to restart the interfaces (Network->Interfaces->Restart) but this is not sustainable. I need to find the root cause.

Steps to Reproduce the Problem

  1. Start OpemMPTCProuter
  2. Observe all green on the status page
  3. Use Internet normally
  4. After a period of time (maybe a few hours), observe that starlink connection is slowing down.
  5. Run omr-test-speed and observe extreme low bandwidth on starlink wan.

Before interface reset (0.01Mb):

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1656k    0 1656k    0     0  **10984**      0 --:--:--  0:02:34 --:--:-- 14882

After interface reset (30Mb):

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  3 9536M    3  356M    0     0  2925k      0  0:55:37  0:02:04  0:53:33 2870k

Context (Environment)

This was hard to track down but I have observed it several times a day for the past few days. I dot not see this speed degradation when I connect starlink to a mesh router (also OpenWRT) or directly to a PC directly.

Specifications

OpenMPTCProuter version: openmptcprouter v0.59.1-5.4 r0+16594-ce92d
OpenMPTCProuter VPS version: wget -O - https://www.openmptcprouter.com/server/debian-x86_64.sh | sh
OpenMPTCProuter VPS provider: linode
OpenMPTCProuter platform: RPI4B
Settings: almost all settings are default.

Log file

starlink interface was manually restarted here: Sun Sep 18 15:36:42 2022 daemon.notice netifd: Interface 'wan1' is now down

Looking at the log I see this message often, is this normal or a problem?: Sun Sep 18 14:54:46 2022 user.notice post-tracking-post-tracking: wan1 (eth1) switched off because check error and ping from wanip error (114.114.114.114,1.1.1.1,4.2.2.1)

Full Log file: https://privatebin.net/?dc3b06c2531f2c68#EiTvACMjsBYBLFL1Q7WEbNnh7Qfk8KLLEgcdNhWsBx3i

ioogithub avatar Sep 18 '22 17:09 ioogithub

I don't have the same problem with Starlink (average speed of 10Mo/s here, 80Mb/s) on a RPI4B with the Starlink router. Check your network configuration and hardware.

Ysurac avatar Sep 18 '22 17:09 Ysurac

I don't have the same problem with Starlink (average speed of 10Mo/s here, 80Mb/s) on a RPI4B with the Starlink router. Check your network configuration and hardware.

All the hardware new 1 week. No errors on the starlink admin page (192.168.100.1), no obstructions. What configuration do you suggest I look at to troubleshoot? I can try a fresh OMR image to rule out configuration problems.

I don't see the problem when I connect without OMR:

  • PC->Starlink ethernet adapter->dish
  • Mesh router (Openwrt)->Starlink ethernet adapter->dish

are there any other logs or statistics I can look at to troubleshoot? Did you notice any problem in the log I attached?

It seems after I reset the interface the issue is fine for a while and it slowest degrades. Does starlink require a TTL or is there a buffer that is filling up?

ioogithub avatar Sep 18 '22 17:09 ioogithub

Do you use any USB to ethernet adapter on the RPI ? or do you use a switch and macvlan interfaces ?

Ysurac avatar Sep 18 '22 17:09 Ysurac

Do you use any USB to ethernet adapter on the RPI ? or do you use a switch and macvlan interfaces ?

Yes i have two TP-Link UE300 connected to two USB 3.0 ports on the pi. I did some iperf testing and pushed data though them for 24h before deploying then, they seemed very stable.

I just switched them because the other wan2 connection doesn't show the same degradation. Do you suspect the adapters themselves or something on the pi USB bus?

ioogithub avatar Sep 18 '22 17:09 ioogithub

Is DNS potentially an issue? Have you tried pinging something like google.com when it slows down to evaluate DNS performance as compared to pinging 8.8.8.8 from OMR and a connected local client if available? Is your DNS setup as expected? We've resolved a lot of our issues by installing TCPDUMP on the VPS and looking at packet captures on both sides (OMR & VPS), which often leads to identifying a failure to receive or transmit packets giving us guidance on where to dig.

Our StarLink connection is Ethernet to Ethernet so we don't have experience with pi USB, however, we have seen many posts on the internet speaking to some pi USB Ethernet hardware not behaving exactly like an 1000BaseT NIC. When we started our testing with v0.58 and our 5G USB T-Mobile modem, ModemManager did not work properly. Subsequently we tested various configurations to get things to work (QMI, MBIM and NCM). I can't remember the configuration specifics, but as we tested the various configurations and succeeded in establishing a connection, we had a similar declining throughput experience with omr-test-speed. Our solutions was to move on to v0.59beta6 with a newer ModemManager that established a working connection with our 5G hardware.

Network-Traditions avatar Sep 18 '22 19:09 Network-Traditions

I'm guessing you've already rebooted your VPS, but we have also noticed times when our speedtests drop to the 5Gbps down 1Gbps up a reboot of the VPS will restore them to about 100Gbps down and about 10Gbps up. This usually occurs when we've been changing configuration of the TCP (scheduler, path manager, SYN retries, etc). We also from time to time (about few times a month), have to power cycle OMR to reset the 5G USB T-Mobile modem, which goes into a state of endless connect for a few seconds before disconnecting an repeating the cycle.

We do have good news coming on our new build in reguards to SQM autorate. We're tentatively describing it as "taming wireless latency with SQM autorate" to stabilize the connection facilitating a more reliable connection with better aggregation performance where we are seeing greater than the best single WAN connection speeds. We think this has been occurring before, but the instability of the latency limited the duration as well as cluttered the results making it difficult to isolate the aggregation.

Network-Traditions avatar Sep 19 '22 22:09 Network-Traditions

I know this makes no sense but I have seen it multiple times. Place a dumb switch in between starlink and the router.

kb1isz avatar Sep 23 '22 23:09 kb1isz

For our Starlink v2 with the Ethernet adapter, we are using "bypass mode" so Starlink supplies a bridged DHCP IP address to the connected OMR Ethernet interface. We've had a similar experience with the wwan ModemManager interface for our USB 3.0 5G modem as well. At times, we have found these configurations benefit from the "Force link" checked in Network-Interfaces-Advanced Settings tab of the aforementioned OMR interfaces. Unchecked, 5G and/or Starlink will sometimes end up in a connect/disconnect loop for reasons unknown at this time. Adding a switch in between Starlink and OMR would likely resolve this issue, add an extra hop and provide additional traffic buffering potentially delivering improved connection stability while adding additional complexity and some latency. While we don't believe the specific problem of this issue would be resolved by SQM autorate, our testing to date has been able to significantly stabilize the aggregation of a 5G connection with Starlink as well as Starlink by itself. Currently we are sacraficing bandwidth to achieve stability, but we expect to increase this efficiency as we learn more about the advanced configuration options of autorate and gather more test results.

Network-Traditions avatar Sep 28 '22 18:09 Network-Traditions

I know this makes no sense but I have seen it multiple times. Place a dumb switch in between starlink and the router.

Are you recommending this because of the USB adapters? I now have a dedicated computer with 4x NIC card, each NIC has it's own processor so there is no shared bus bandwidth. Do you still think a simple unmanaged switch would help in this case?

ioogithub avatar Sep 29 '22 21:09 ioogithub

While we don't believe the specific problem of this issue would be resolved by SQM autorate, our testing to date has been able to significantly stabilize the aggregation of a 5G connection with Starlink as well as Starlink by itself.

What did you set the SQM autorotate values at for the Starlink?

Adding a switch in between Starlink and OMR would likely resolve this issue, add an extra hop and provide additional traffic buffering potentially delivering improved connection stability while adding additional complexity and some latency.

Do you currently have a switch between your starlink and your OMR router, is this the OMR router with the 4 NIC ports? Are you suggesting a simple unmanaged switch to somehow act as a buffer?

ioogithub avatar Sep 29 '22 21:09 ioogithub

Our OMR WAN interface is directly connected to the Starlink v2 Ethernet adapter, no switching equipment in between. We are back to using our HUNSN device with the i225 NICs with good results. As far as SQM autorate, we still have a great deal of testing to complete as we are digging into the details of how it functions and how we may wish to apply more advanced tweaks to its configuration.

Bearing in mind our conclusions to date may change, but this is where we are at at the moment:

  1. Use the OpenMPTCProuter wizard to set up SQM initially. This requires selecting "Enable SQM" and "Enable SQM autorate". While each can be selected independently, reviewing the resulting configuration in "Network-SQM Qos", the language of the seems to clarify that this is an on/off toggle for all SQM including autorate.
  2. Input desired Download and Upload speeds in the wizard. The wizard sets up the autorate configuration based on these inputs, however, we've been modifying them afterward in "Network-SQM Qos".
  3. After the wizard completes, adjust the settings as desired in "Network-SQM Qos". We've included this procedure in the event the wizard is doing more than just setting the obvious values. Maybe @Ysurac could comment on what the wizard does beyond the settings found in "Network-SQM Qos".

From there, lots of testing to include but not be limited to, speedtest.net, bufferbloat, fast.com, actual live system activities (high bandwidth youtube, 4K streaming, VOIP latency sensitive phone calls, Remote Desktop, Web surfing, large uploads and downloads, VPN traffic, etc). Synthetic testing is helpful to provide guidance, but real world experience often differs tremendously. Since we have pfSense as our local gateway router, we use its connection feedback to evaluate latency issues during testing. We have a number of OpenVPN client connections to customer pfSense points of presence and have found they are very sensitive to OMR's efficiency.

We have a long way to go on this path and currently, we are leaving a fair amount of bandwidth on the table to ensure quality latency and reduce connection warnings about the VPS and the connected WAN services. So far we have demonstrated we can achieve a satisfactory level of connection quality. Now we wish to see how much bandwidth we can squeeze out of our services. The dramatically varying performance of Starlink and our T-Mobile connection complicates this process since our signal weakness causes the service to switch "Ultra Capcity" (UC) on and off unpredictably. Subsequently our Starlink has speeds ranging from 30-150Mbps download with 5-30Mbps upload and T-Mobile ranges from 10-500Mbps download with 2.5-50Mbps upload.

Given the aforementioned service measurements, we currently have T-Mobile SQM autorate set at 30Mbps base download, 10Mbps minimum download, 50Mbps maximum download, 10Mbps base upload, 2.5Mbps minimum upload and 20Mbps maximum upload. Starlink is set at 50Mbps base download, 30Mbps minimum download, 150Mbps maximum download, 10Mbps base upload, 2.5Mbps minimum upload and 20Mbps maximum upload. We left the "Queue Discipline" to cake; piece_of_cake.qos. OpenMPTCProuter wizard sets the "Link Layer Adaptation" to ATM with 40 "Per Packet Overhead (byte)". We're changing this value to Ethernet with 40 based on not using any DSL services, which the documentation suggests the ATM selection incorporates with its selection. This is still a wildcard as we will be attempting to determine the exact overhead of our connections and set things accordingly. Available information suggests while these settings are not deal breakers they will impact the performance of autorate especially with smaller VOIP packets. The only change we've implemented on the "Autorate settings" tabs is unchecking "Sleep functionality". Hard to say if this helps, it feels as though it does, but our thoughts are we don't want autorate "sleeping" since our connection is significantly active 7/24/365.

We've learned the following about autorate:

  1. It apparently learns about its controlled connection over time
  2. When it initiates a connection it starts at the minimum rates and ramps up according to its algorithm and the settings.
  3. There are circumstances where it will attempt bandwidth performance beyond the settings, though we are unsure of the details of how this is implemented.

Key issues we have yet to establish:

  1. What procedure best clears out prior traffic, cache and other factors that would improperly influenct a new configuration and subsequent testing? (We've gone as far as restarting all the devices of the connection.)
  2. How long should a configuration be implemented before drawing conclusions about the results?
  3. What "Autorate settings" influence and what changes should we consider for our respective services?

Our results to date may be irrelevant due to switching our "Default Proxy" to Shadowsocks as a result of our issues with #2583. We will be interested to see what you find as well.

Network-Traditions avatar Sep 29 '22 22:09 Network-Traditions

  1. While each can be selected independently,

I was just testing this with trial and error and it was confusing me as well. So right now I have "Enable SQM autorotate" enabled but "Enable this SQM instance is not checked, so does this mean the SQM is off but SQM autorotate is on or does it mean they are both off? Thanks for alerting me to this "Network-SQM Qos" section I will look more closely at this from now on. It seems there is some UI issue here if one should be connected to the other.

Since I have moved to this new platform which eliminates the weak RPI CPU, the USB adapters with shared bus bandwidth and potential power issues, I am still seeing the same bandwidth bottlenecks as the old platform. I have also tried two different VPS as well.

I have one rock solid 4G connection and one starlink. If I omr-test-speed each wan separately I get an average of 25Mb/s on the 4G and around 40Mb/s on the starlink.

When I bond 4G and starlink I end up with less than 18Mb/s in total so slower than just the slowest (4G) connection alone. OME actually destabilizes the 4G connection to around 15Mb/s and the starlink ends up at 3Mb/s

So far I can't find any combination of settings that seems to fix this and SQM doesn't seem to be helping.

The conclusion is unfortunately always an aggregated bandwidth that is even slower than the slower (4G) connection.

ioogithub avatar Sep 29 '22 23:09 ioogithub

So right now I have "Enable SQM autorotate" enabled but "Enable this SQM instance is not checked, so does this mean the

SQM is off but SQM autorotate is on or does it mean they are both off? It's my opinion based on the language of "Network-SQM Qos-Basic Settings": "Enable this SQM instance." implies that unless this box is checked, everything SQM for the identified "Interface name" is essentially off.

When checking Enable SQM in the wizard, you will find "Enable this SQM instance." checked and likewise if you change the setting in "Network-SQM Qos-Basic Settings", the change will be reflected on the wizard page so they seem directly linked. I trial and error tested the same concept and believe that the "Enable" SQM must be selected before enabling autorate will have any impact.

As I've become more experienced with our OMR deployment and have begun to stabilize the RTT latency for both the Starlink and T-Mobile connections in addition to the deviation between them. I am now confident we are achieving signifcant bandwidth aggregation where download and upload speeds are achieving results that exceed any single service. I continue to refine methods and tools to build better and more meaningful results and will be posting what I learn along the way.

Network-Traditions avatar Sep 30 '22 01:09 Network-Traditions

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Dec 29 '22 19:12 github-actions[bot]

I have almost similar issue, however it happens spontaneously after I reconnect the cable or reboot the dishy. So sometimes I see on my router speed as 10Mbps, client does not receive the IP from Starlink's DHCP and in the debug tab in Starlink app I see error saying "Ethernet is slow". Only switching off\on the network port or reconnecting the cable helps...

adutchak avatar Jan 04 '23 16:01 adutchak

@adutchak check your ethernet cable between Starlink and your router and check with ifconfig via ssh on OpenMPTCProuter if there is no errors on the interface.

Ysurac avatar Jan 04 '23 17:01 Ysurac

@adutchak degraded speed aside, as far as Starlink DHCP you may find this link helpful: https://nelsonslog.wordpress.com/2021/04/07/openwrt-vs-starlink-dhcp-leases. Combined with the details revealed #2584, you should have better success with Starlink DHCP assignment at least as it relates to Starlink v2, which is the equipment I'm using. To @Ysurac comments, it does seem likely Ethernet link negotiation is not succeeding thereby no DHCP assignment so double checking the connectors, cable and Ethernet configuration from Dishy to your router is the best place to start. If you have a laptop, plug it in directly to your Dishy to validate Dishy, its Ethernet connector and the cable being used, then move onto the OMR side of the equation.

Network-Traditions avatar Jan 04 '23 20:01 Network-Traditions

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Apr 05 '23 19:04 github-actions[bot]