core icon indicating copy to clipboard operation
core copied to clipboard

PPP connection failures after updating from 24.7.12_4 to 25.1.12 reconnect fails 10+s ping

Open raldone01 opened this issue 4 months ago • 20 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
  • [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug

After some time 0-2 days of running perfectly, the ppp link drops and fails to reconnect. Sometimes it fails so badly, that the ping to opnsense becomes 10+s and the firewall webgui becomes unusable. After the ppp ethernet cable is unplugged the ping recovered and the firewall became responsive again.

Here is an excerpt of the pinging during the issue. It recovered once the cable was unplugged.

64 bytes from 172.24.0.1: icmp_seq=537 ttl=64 time=8417 ms
64 bytes from 172.24.0.1: icmp_seq=538 ttl=64 time=8487 ms
64 bytes from 172.24.0.1: icmp_seq=539 ttl=64 time=8620 ms
64 bytes from 172.24.0.1: icmp_seq=540 ttl=64 time=8551 ms
64 bytes from 172.24.0.1: icmp_seq=541 ttl=64 time=8589 ms
64 bytes from 172.24.0.1: icmp_seq=542 ttl=64 time=8694 ms
64 bytes from 172.24.0.1: icmp_seq=543 ttl=64 time=8666 ms
64 bytes from 172.24.0.1: icmp_seq=544 ttl=64 time=8842 ms
64 bytes from 172.24.0.1: icmp_seq=546 ttl=64 time=10820 ms
64 bytes from 172.24.0.1: icmp_seq=547 ttl=64 time=10752 ms
64 bytes from 172.24.0.1: icmp_seq=548 ttl=64 time=10856 ms
64 bytes from 172.24.0.1: icmp_seq=549 ttl=64 time=10919 ms
64 bytes from 172.24.0.1: icmp_seq=550 ttl=64 time=10864 ms
64 bytes from 172.24.0.1: icmp_seq=551 ttl=64 time=11099 ms
64 bytes from 172.24.0.1: icmp_seq=552 ttl=64 time=10960 ms
64 bytes from 172.24.0.1: icmp_seq=553 ttl=64 time=10887 ms
64 bytes from 172.24.0.1: icmp_seq=557 ttl=64 time=10538 ms
64 bytes from 172.24.0.1: icmp_seq=558 ttl=64 time=10649 ms
64 bytes from 172.24.0.1: icmp_seq=559 ttl=64 time=10357 ms
64 bytes from 172.24.0.1: icmp_seq=560 ttl=64 time=10000 ms
64 bytes from 172.24.0.1: icmp_seq=561 ttl=64 time=9715 ms
64 bytes from 172.24.0.1: icmp_seq=562 ttl=64 time=9245 ms
64 bytes from 172.24.0.1: icmp_seq=563 ttl=64 time=9084 ms
64 bytes from 172.24.0.1: icmp_seq=564 ttl=64 time=8828 ms
64 bytes from 172.24.0.1: icmp_seq=567 ttl=64 time=8617 ms
64 bytes from 172.24.0.1: icmp_seq=568 ttl=64 time=8390 ms
64 bytes from 172.24.0.1: icmp_seq=569 ttl=64 time=8407 ms
64 bytes from 172.24.0.1: icmp_seq=570 ttl=64 time=8370 ms
64 bytes from 172.24.0.1: icmp_seq=571 ttl=64 time=8537 ms
64 bytes from 172.24.0.1: icmp_seq=572 ttl=64 time=8670 ms
64 bytes from 172.24.0.1: icmp_seq=573 ttl=64 time=8736 ms
64 bytes from 172.24.0.1: icmp_seq=574 ttl=64 time=8300 ms
64 bytes from 172.24.0.1: icmp_seq=575 ttl=64 time=7626 ms
64 bytes from 172.24.0.1: icmp_seq=576 ttl=64 time=10331 ms
64 bytes from 172.24.0.1: icmp_seq=577 ttl=64 time=10264 ms
64 bytes from 172.24.0.1: icmp_seq=578 ttl=64 time=10239 ms
64 bytes from 172.24.0.1: icmp_seq=579 ttl=64 time=10076 ms
64 bytes from 172.24.0.1: icmp_seq=580 ttl=64 time=9801 ms
64 bytes from 172.24.0.1: icmp_seq=581 ttl=64 time=9620 ms
64 bytes from 172.24.0.1: icmp_seq=582 ttl=64 time=9505 ms
64 bytes from 172.24.0.1: icmp_seq=583 ttl=64 time=9841 ms
64 bytes from 172.24.0.1: icmp_seq=586 ttl=64 time=10181 ms
64 bytes from 172.24.0.1: icmp_seq=587 ttl=64 time=10037 ms
64 bytes from 172.24.0.1: icmp_seq=588 ttl=64 time=9881 ms
64 bytes from 172.24.0.1: icmp_seq=589 ttl=64 time=9794 ms
64 bytes from 172.24.0.1: icmp_seq=590 ttl=64 time=9970 ms
64 bytes from 172.24.0.1: icmp_seq=591 ttl=64 time=9867 ms
64 bytes from 172.24.0.1: icmp_seq=592 ttl=64 time=9627 ms
64 bytes from 172.24.0.1: icmp_seq=593 ttl=64 time=9624 ms
64 bytes from 172.24.0.1: icmp_seq=594 ttl=64 time=8843 ms
64 bytes from 172.24.0.1: icmp_seq=595 ttl=64 time=8031 ms
64 bytes from 172.24.0.1: icmp_seq=596 ttl=64 time=9001 ms
64 bytes from 172.24.0.1: icmp_seq=597 ttl=64 time=8958 ms
64 bytes from 172.24.0.1: icmp_seq=598 ttl=64 time=8185 ms
64 bytes from 172.24.0.1: icmp_seq=599 ttl=64 time=7357 ms
64 bytes from 172.24.0.1: icmp_seq=600 ttl=64 time=6643 ms
64 bytes from 172.24.0.1: icmp_seq=601 ttl=64 time=5827 ms
64 bytes from 172.24.0.1: icmp_seq=602 ttl=64 time=4975 ms
64 bytes from 172.24.0.1: icmp_seq=603 ttl=64 time=4125 ms
64 bytes from 172.24.0.1: icmp_seq=604 ttl=64 time=3349 ms
64 bytes from 172.24.0.1: icmp_seq=605 ttl=64 time=2611 ms
64 bytes from 172.24.0.1: icmp_seq=606 ttl=64 time=1730 ms
64 bytes from 172.24.0.1: icmp_seq=607 ttl=64 time=855 ms
64 bytes from 172.24.0.1: icmp_seq=608 ttl=64 time=147 ms
64 bytes from 172.24.0.1: icmp_seq=609 ttl=64 time=2.90 ms
64 bytes from 172.24.0.1: icmp_seq=610 ttl=64 time=0.158 ms
64 bytes from 172.24.0.1: icmp_seq=611 ttl=64 time=26.9 ms
64 bytes from 172.24.0.1: icmp_seq=612 ttl=64 time=2.19 ms
64 bytes from 172.24.0.1: icmp_seq=613 ttl=64 time=14.9 ms
64 bytes from 172.24.0.1: icmp_seq=614 ttl=64 time=180 ms
64 bytes from 172.24.0.1: icmp_seq=615 ttl=64 time=295 ms
64 bytes from 172.24.0.1: icmp_seq=616 ttl=64 time=345 ms
64 bytes from 172.24.0.1: icmp_seq=617 ttl=64 time=288 ms
64 bytes from 172.24.0.1: icmp_seq=618 ttl=64 time=344 ms
64 bytes from 172.24.0.1: icmp_seq=619 ttl=64 time=66.5 ms
64 bytes from 172.24.0.1: icmp_seq=620 ttl=64 time=0.158 ms
64 bytes from 172.24.0.1: icmp_seq=621 ttl=64 time=0.180 ms
64 bytes from 172.24.0.1: icmp_seq=622 ttl=64 time=11.5 ms
64 bytes from 172.24.0.1: icmp_seq=623 ttl=64 time=5.46 ms
64 bytes from 172.24.0.1: icmp_seq=624 ttl=64 time=12.0 ms
64 bytes from 172.24.0.1: icmp_seq=625 ttl=64 time=0.239 ms
64 bytes from 172.24.0.1: icmp_seq=626 ttl=64 time=0.233 ms
64 bytes from 172.24.0.1: icmp_seq=627 ttl=64 time=0.247 ms
64 bytes from 172.24.0.1: icmp_seq=628 ttl=64 time=3.85 ms
64 bytes from 172.24.0.1: icmp_seq=629 ttl=64 time=0.185 ms
64 bytes from 172.24.0.1: icmp_seq=630 ttl=64 time=0.456 ms
64 bytes from 172.24.0.1: icmp_seq=631 ttl=64 time=0.138 ms
64 bytes from 172.24.0.1: icmp_seq=632 ttl=64 time=111 ms
64 bytes from 172.24.0.1: icmp_seq=633 ttl=64 time=21.7 ms
64 bytes from 172.24.0.1: icmp_seq=634 ttl=64 time=17.9 ms
64 bytes from 172.24.0.1: icmp_seq=635 ttl=64 time=19.7 ms
64 bytes from 172.24.0.1: icmp_seq=636 ttl=64 time=23.3 ms
64 bytes from 172.24.0.1: icmp_seq=637 ttl=64 time=0.227 ms
64 bytes from 172.24.0.1: icmp_seq=638 ttl=64 time=0.304 ms
64 bytes from 172.24.0.1: icmp_seq=639 ttl=64 time=0.209 ms

I also downloaded the logs before rebooting.

To Reproduce

Have a ppp link to the ISP and let it run for 0-2 days. I am sure more information is necessary and I am happy to provide it.

Expected behavior

The ppp link should remain stable and not cause any issues.

Describe alternatives you considered

Downgrading. Which I will do soon.

Relevant log files

Full week log. Full week log google drive mirror.

Only the last boot should be relevant but I have included everything in the log. The queues blow up after the wan connection dies which is expected.

Additional context

This problem has happened on 25.7.1 for the first time. It also happens on 25.7.2.

Environment

OPNsense 25.7.2-amd64.

raldone01 avatar Aug 30 '25 07:08 raldone01

Did you end up downgrading? I am asking because this is a 7 days ticket.

sopex avatar Aug 30 '25 11:08 sopex

I did not downgrade yet. But it crashed again last night. My parents are pressuring me quite hard now. I did some config changes and I don't know what the best way of porting them would be or if opnsense configs are backwards compatible. I also have a backup of the old config before the upgrade. Maybe I will just diff the xml or something.

I can push out the downgrade though if you need me to gather some data.

Inconsistent slow issues are just the worst. I taught my mum to replug the opnsense which is less than ideal.

raldone01 avatar Aug 30 '25 11:08 raldone01

It crashed again. I have now downgraded to 24.7. I will comment again in a week to confirm that it is a regression.

raldone01 avatar Sep 01 '25 16:09 raldone01

It has been a week with no crashes and stable internet on OPNsense 24.7-amd64.

@sopex if you want I can upgrade once more to verify the issue happens upon updating. Or how should I proceed?

raldone01 avatar Sep 07 '25 10:09 raldone01

It has been a week with no crashes and stable internet on OPNsense 24.7-amd64.

@sopex if you want I can upgrade once more to verify the issue happens upon updating. Or how should I proceed?

If you want to create a snapshot to downgrade more easily and do it... Go ahead, sure.

sopex avatar Sep 07 '25 11:09 sopex

Okay I noticed that I first have to upgrade to 24.7.12. I will give it three days with that version before snapshotting and upgrading to 25.

raldone01 avatar Sep 07 '25 12:09 raldone01

I just noticed 24.7. Why did you go back there? Was the problem also present on 25.1?

sopex avatar Sep 07 '25 15:09 sopex

I am not 100% sure. I will update gradually if that's possible to make sure I notice the exact version upgrade that caused it.

raldone01 avatar Sep 07 '25 21:09 raldone01

24.7.12_4 to 25.1.12 causes the issue 100%. I have verified it on a fresh install.

opnsense_25.1.12_redacted.log

raldone01 avatar Sep 19 '25 06:09 raldone01

#9222 seems similar -- I'm not claiming there is no issue, but it's rather elusive without further triage by people running into the issue.

fichtner avatar Sep 19 '25 07:09 fichtner

I will remain on 24.7 for now. If you need me to collect some logs or run experiments feel free to tell me. Snapshotting makes experimentation and rollback easy.

raldone01 avatar Sep 19 '25 07:09 raldone01

Also my interfaces are called igc not bge and I also have an issue.

raldone01 avatar Sep 19 '25 07:09 raldone01

If you can converge on a common first version that broke this that would be helpful. At the moment it's mostly assumptions. I see a lot of "Network down" error in your logs which typically doesn't have a cause in scripting and config file management that we do in "core".

fichtner avatar Sep 19 '25 07:09 fichtner

You mean: Warning radvd sendmsg: Network is down? I think this is just caused by my emergency lan where no device is connected. Radvd is ipv6 router advertisement stuff right? Pity it doesn't say what network is down.

raldone01 avatar Sep 19 '25 07:09 raldone01

Depending on the configuration this may cause spurious reloads of IPv6 which could also cycle the WAN connection, but this isn't easily debuggable in community scope.

fichtner avatar Sep 19 '25 07:09 fichtner

Depending on roughly what configuration? While not ideal, cycling WAN should not cause any issues though right?

raldone01 avatar Sep 19 '25 07:09 raldone01

I performed a clean install of v25.1 and then performed incremental upgrades to determine v25.1.3 as the first version to have the PPPoE issue, see #9222

It would be interesting to understand if it is the same in your case.

Summit48 avatar Sep 21 '25 13:09 Summit48

I agree. I will take some time to do it though. Is there an easy way to go from my 24.7.12_4 and update incrementally by cli or gui?

raldone01 avatar Sep 21 '25 14:09 raldone01

See these instructions provided by @fichtner

https://github.com/opnsense/core/issues/9222#issuecomment-3312145245

Summit48 avatar Sep 21 '25 22:09 Summit48

I will track my progress here.

24.7.12_4 34/09/2025 -> 25.1.1 3/10/2025 -> 25.1.2 17.10.2025 -> 25.1.3 10.11.2025 -> 25.1.4 6/12/2025 -> 25.1.5 (Testing)

raldone01 avatar Sep 24 '25 10:09 raldone01