klipper icon indicating copy to clipboard operation
klipper copied to clipboard

Improve Clock Synchronization in Multi-MCU Environments

Open mfeldheim opened this issue 1 year ago • 19 comments

Summary

This PR addresses issues with timing drift and synchronization stability in the clocksync.py module. The changes improve clock alignment between the main MCU and secondary MCUs, ensuring robust operation in multi-MCU environments.


Key Changes

  1. Improved Event Scheduling in _get_clock_event:

    • Added an explicit call to update_timer to ensure the next clock synchronization event is properly scheduled.
    • Updated Code:
      next_eventtime = eventtime + .9839
      self.reactor.update_timer(self.get_clock_timer, next_eventtime)
      return next_eventtime
      
    • Rationale: Prevents misalignments in periodic events, especially when overlapping with other timers.
  2. Refined Query Tracking in _handle_clock:

    • Modified self.queries_pending logic to decrement instead of resetting to zero:
      self.queries_pending = max(0, self.queries_pending - 1)
      
    • Rationale: Accurately tracks pending queries, even when multiple queries overlap or complete asynchronously.
  3. Enhanced calibrate_clock for Synchronization:

    • Streamlined synchronization point calculations:
      sync1_print_time = max(print_time, est_print_time)
      time_diff = print_time - est_print_time
      sync2_print_time = max(sync1_print_time + 4., self.last_sync_time, print_time + 2.5 * time_diff)
      
    • Adjusted the calculation of frequency (adjusted_freq) and offset (adjusted_offset):
      time_span = sync2_print_time - sync1_print_time
      adjusted_freq = (sync2_clock - sync1_clock) / time_span
      adjusted_offset = sync1_print_time - (sync1_clock / adjusted_freq)
      
    • Rationale: Improves synchronization accuracy between the main MCU and secondary MCUs.

Testing

  • Tested in a physical triple-MCU environment under various workloads and conditions:
    • Verified synchronization accuracy and timing stability.
    • Stress-tested with high-frequency events under artificial load.

Impact

  • Enhanced synchronization accuracy reduces timing drift across MCUs.
  • Improved handling of queries ensures robust and reliable operation in multi-MCU setups.

This PR enhances the stability and accuracy of clock synchronization for multi-MCU environments, addressing previous issues with drift and event scheduling. Feedback and additional testing are welcome!

mfeldheim avatar Nov 30 '24 20:11 mfeldheim

if M106 is frequently called or G2/G3 is used to move at a higher speed in the slicing file, can this request better adapt to such pressure testing?

Yangyuanxin avatar Dec 01 '24 16:12 Yangyuanxin

if M106 is frequently called or G2/G3 is used to move at a higher speed in the slicing file, can this request better adapt to such pressure testing?

Do you have a specific test in mind?

mfeldheim avatar Dec 01 '24 17:12 mfeldheim

Thank you for submitting a PR, please be aware that you need to sign off to accept the developer certificate of origin, please see point 3 in https://github.com/Klipper3d/klipper/blob/master/docs/CONTRIBUTING.md#what-to-expect-in-a-review

Thanks James

JamesH1978 avatar Dec 02 '24 08:12 JamesH1978

; Start G-Code Start G90 M104 S185 M109 S185 G28 M117 G1 Z5 F9600 G1 X130 Y127 F30000 ; Start G-Code End ; Execute XX times in a loop, you can decide on your own M106 S255 M106 S128 G1 F30000 G1 X125 Y125 Z20 M106 S0 G2 X124.5 Y124.5 Z19 E1 I10.5 J10.5 M106 S255 M106 S128 G2 X125 Y125 Z20 E1 I10.5 J10.5 G2 X124.5 Y124.5 Z20 E1 I10.5 J10.5 M106 S0 G2 X125 Y125 I0 J10 G2 X124.5 Y124.5 J10 M106 S255 M106 S128 G2 X125 Y125 I10 J0 G2 X124.5 Y124.5 I10 M106 S0 G1 X125 Y125 Z19 M106 S255 M106 S128 G2 X124.5 Y124.5 Z20 E1 I10.5 J10.5 M106 S0 G17 ; Execute XX times in a loop, you can decide on your own

Yangyuanxin avatar Dec 02 '24 12:12 Yangyuanxin

@Yangyuanxin, I’ve successfully run 100 loops on my hardware using the branch without any issues. My setup has the X/Y steppers on the same MCU, I assume arc commands would highlight any synchronization problems if X and Y were running on separate MCUs. Feel free to test it on your hardware. I’ve temporarily moved the Pull Request to draft status to comply with Klipper PR requirements and to write unit tests

mfeldheim avatar Dec 02 '24 14:12 mfeldheim

Found this while researching various disconnect issues for a KP3S Pro V2 (Connection timeout on homing) as well as Lost Communication errors when using the CoPrint Multi-Color system.

CoPrint is yet untested (as I am not experiencing the issue such as some community members are) but on the KP3S Pro V2 I can report very Positive results.

K3PS Pro V2 uses a MKS Pi Clone board for the mainboard and a RP2040 board for the printhead. RP2040 is over CANBUS.

Before replacing the file: Using (at time of writing) mainline klipper, 80% of the time I am unable to complete a homing action, even if the printhead is less than 2 in from the bed surface. Once Z motion starts and klipper is now looking for a signal from the rp2040 board instead of a sensor on the main board, a timeout occurs almost instantly.

After replacing the file: I have been homing and re-homing the printer for the better part of 30 minutes and I still have been unable to produce the Connection timeout on homing error. Printing has yet to be tested.

This FIXES the Connection timeout on homing error, absolutely. Further action and testing is needed but I believe from my current testing that THIS RIGHT HERE is the TRUE FIX to the cop-out repair job that is the TRSYNC_TIMEOUT fix.

Provided this holds true for others as well and there are no knock-on effects from this change I believe this should be merged to mainline with all haste!!

Update: Well, fun while it lasted. After sitting for 3 hours then coming back and trying again, Did get the timeout error again on homing so not a perfect fix. Will continue use testing since an improvement was visible for a time.

lanpartyceo avatar Dec 02 '24 21:12 lanpartyceo

I have a system plagued by "MCU shutdown timer too close" issues when using three MCU's. I can guarantee a timer too close situation 90% of the time when performing certain functions. It's perfectly fine when using only two MCU's. I want to try your PR, but I have never done this before. How can I install and test this the most scientific way to see if it helps? EDIT: So I replaced my clocksync.py with your clocksync.py and was immediately able to reproduce my issue, so no change. Can I help in any way?

53Aries avatar Dec 03 '24 04:12 53Aries

I have a system plagued by "MCU shutdown timer too close" issues when using three MCU's. I can guarantee a timer too close situation 90% of the time when performing certain functions. It's perfectly fine when using only two MCU's. I want to try your PR, but I have never done this before. How can I install and test this the most scientific way to see if it helps? EDIT: So I replaced my clocksync.py with your clocksync.py and was immediately able to reproduce my issue, so no change. Can I help in any way?

Thanks for testing. What are your MCUs and how are they communicating? It would be interesting to see your klippy.log, could you attach it as a file? :paperclip:

mfeldheim avatar Dec 04 '24 09:12 mfeldheim

Found this while researching various disconnect issues for a KP3S Pro V2 (Connection timeout on homing) as well as Lost Communication errors when using the CoPrint Multi-Color system.

CoPrint is yet untested (as I am not experiencing the issue such as some community members are) but on the KP3S Pro V2 I can report very Positive results.

K3PS Pro V2 uses a MKS Pi Clone board for the mainboard and a RP2040 board for the printhead. RP2040 is over CANBUS.

Before replacing the file: Using (at time of writing) mainline klipper, 80% of the time I am unable to complete a homing action, even if the printhead is less than 2 in from the bed surface. Once Z motion starts and klipper is now looking for a signal from the rp2040 board instead of a sensor on the main board, a timeout occurs almost instantly.

After replacing the file: I have been homing and re-homing the printer for the better part of 30 minutes and I still have been unable to produce the Connection timeout on homing error. Printing has yet to be tested.

This FIXES the Connection timeout on homing error, absolutely. Further action and testing is needed but I believe from my current testing that THIS RIGHT HERE is the TRUE FIX to the cop-out repair job that is the TRSYNC_TIMEOUT fix.

Provided this holds true for others as well and there are no knock-on effects from this change I believe this should be merged to mainline with all haste!!

Update: Well, fun while it lasted. After sitting for 3 hours then coming back and trying again, Did get the timeout error again on homing so not a perfect fix. Will continue use testing since an improvement was visible for a time.

Thanks for testing, sorry to hear you're still having issues. If you don't mind, attach your klippy.log as a file

mfeldheim avatar Dec 04 '24 09:12 mfeldheim

I have a system plagued by "MCU shutdown timer too close" issues when using three MCU's. I can guarantee a timer too close situation 90% of the time when performing certain functions. It's perfectly fine when using only two MCU's. I want to try your PR, but I have never done this before. How can I install and test this the most scientific way to see if it helps? EDIT: So I replaced my clocksync.py with your clocksync.py and was immediately able to reproduce my issue, so no change. Can I help in any way?

Thanks for testing. What are your MCUs and how are they communicating? It would be interesting to see your klippy.log, could you attach it as a file? 📎

This is a qidi q1 "rooted" to armbian bookworm and using up to date klipper 12. It uses a host/ mainboard combo from MKS based on the MKS skipr. Host is a cortex a53 1.5ghz. Main mcu is stm32f407. The toolhead is a rp2040 connected via usb to the mainboard. I have no issues with this part. When I add in another usb mcu, stm32h723, for Boxturtle AFC I will get timer too close shutdowns on filament homing moves typically right as the system starts looking for the status of my toolhead filament sensor(located on the rp2040 toolhead board). I have as yet not ruled out a usb cable issue however the system does work when using Happy Hare software which is more macro based vs AFC which is more python based. klippy (64).log

53Aries avatar Dec 04 '24 15:12 53Aries

Found this while researching various disconnect issues "" "" Thanks for testing, sorry to hear you're still having issues. If you don't mind, attach your klippy.log as a file

Here is a log file of the error occuring.. After a log rollover, homing was attempted and the error occurred.

klippy_after error.log

lanpartyceo avatar Dec 04 '24 17:12 lanpartyceo

This is a qidi q1 "rooted" to armbian bookworm and using up to date klipper 12. It uses a host/ mainboard combo from MKS based on the MKS skipr. Host is a cortex a53 1.5ghz. Main mcu is stm32f407. The toolhead is a rp2040 connected via usb to the mainboard. I have no issues with this part. When I add in another usb mcu, stm32h723, for Boxturtle AFC I will get timer too close shutdowns on filament homing moves typically right as the system starts looking for the status of my toolhead filament sensor(located on the rp2040 toolhead board). I have as yet not ruled out a usb cable issue however the system does work when using Happy Hare software which is more macro based vs AFC which is more python based. klippy (64).log

This doesn't seem to be a clocksync issue, the connection to the AFC mcu seems to drop packets. The logs show retransmissions (bytes_retransmit=685) and discrepancies between send_seq and receive_seq for the AFC MCU, which indicates dropped or delayed packets.

MCU 'AFC' shutdown: Timer too close
clocksync state: mcu_freq=400000000 last_clock=111883991052 clock_est=(493.659 100105001940 400006818.573) min_half_rtt=0.000117 min_rtt_time=304.502 time_avg=493.659(847.524) clock_avg=100105001940.840(339015564916.959) pred_variance=79436434.483 clock_adj=(-0.025 399995478.105)
Dumping serial stats: bytes_write=35809 bytes_read=37465 bytes_retransmit=685 bytes_invalid=0 send_seq=1482 receive_seq=1470 retransmit_seq=1481 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=103 upcoming_bytes=0

mfeldheim avatar Dec 05 '24 14:12 mfeldheim

That one is more likely to be crappy usb bandwdith and trying to run too many usb devices at the same time, even more so if a camera is also in play. The error being correct for that one, MCU 'AFC' shutdown: Timer too close This often indicates the host computer is overloaded

JamesH1978 avatar Dec 05 '24 14:12 JamesH1978

Here is a log file of the error occuring.. After a log rollover, homing was attempted and the error occurred.

klippy_after error.log

Your MCU communication looks flawless to be honest. Not sure what's the issue here but it's not clocksync-related imho. What's your Z-homing sensor? Does slightly increasing TRSYNC_TIMEOUT e.g. to 0.030 change anything for you?

RTT below TRSYNC_TIMEOUT Smoothened Roundtrip Time 0.001 is well within default TRSYNC_TIMEOUT of 0.025

Stable Clocksync Parameters srtt=0.001, rttvar=0.000, rto=0.025: The round-trip time (RTT) and its variance are extremely low and consistent, indicating a stable communication link. The MCU clock frequency (freq) shows a minor adjustment from 72000323 to 72000380, but this is negligible and within normal bounds.

No Communication Errors: bytes_retransmit=9, bytes_invalid=0: Retransmissions remain unchanged, suggesting no packet loss or corruption during this period. send_seq and receive_seq are in sync, indicating no desynchronization between the host and MCU.

Clocksync Adjustments: The frequency adjustment (freq=72000380) is very slight and not indicative of a significant clocksync issue. The overall sync appears stable with no evidence of excessive jitter or drift.

mfeldheim avatar Dec 05 '24 14:12 mfeldheim

This is a qidi q1 "rooted" to armbian bookworm and using up to date klipper 12. It uses a host/ mainboard combo from MKS based on the MKS skipr. Host is a cortex a53 1.5ghz. Main mcu is stm32f407. The toolhead is a rp2040 connected via usb to the mainboard. I have no issues with this part. When I add in another usb mcu, stm32h723, for Boxturtle AFC I will get timer too close shutdowns on filament homing moves typically right as the system starts looking for the status of my toolhead filament sensor(located on the rp2040 toolhead board). I have as yet not ruled out a usb cable issue however the system does work when using Happy Hare software which is more macro based vs AFC which is more python based. klippy (64).log

This doesn't seem to be a clocksync issue, the connection to the AFC mcu seems to drop packets. The logs show retransmissions (bytes_retransmit=685) and discrepancies between send_seq and receive_seq for the AFC MCU, which indicates dropped or delayed packets.

MCU 'AFC' shutdown: Timer too close
clocksync state: mcu_freq=400000000 last_clock=111883991052 clock_est=(493.659 100105001940 400006818.573) min_half_rtt=0.000117 min_rtt_time=304.502 time_avg=493.659(847.524) clock_avg=100105001940.840(339015564916.959) pred_variance=79436434.483 clock_adj=(-0.025 399995478.105)
Dumping serial stats: bytes_write=35809 bytes_read=37465 bytes_retransmit=685 bytes_invalid=0 send_seq=1482 receive_seq=1470 retransmit_seq=1481 srtt=0.000 rttvar=0.000 rto=0.025 ready_bytes=103 upcoming_bytes=0

Thank you. Maybe the usb system is fine when using Happy Hair software because it uses less bandwidth. I'll try hardware solutions like changing the usb cable and removing the camera from the usb to see what happens.

53Aries avatar Dec 05 '24 16:12 53Aries

Here is a log file of the error occuring.. After a log rollover, homing was attempted and the error occurred. klippy_after error.log

Your MCU communication looks flawless to be honest. Not sure what's the issue here but it's not clocksync-related imho. What's your Z-homing sensor? Does slightly increasing TRSYNC_TIMEOUT e.g. to 0.030 change anything for you?

RTT below TRSYNC_TIMEOUT Smoothened Roundtrip Time 0.001 is well within default TRSYNC_TIMEOUT of 0.025

Stable Clocksync Parameters srtt=0.001, rttvar=0.000, rto=0.025: The round-trip time (RTT) and its variance are extremely low and consistent, indicating a stable communication link. The MCU clock frequency (freq) shows a minor adjustment from 72000323 to 72000380, but this is negligible and within normal bounds.

No Communication Errors: bytes_retransmit=9, bytes_invalid=0: Retransmissions remain unchanged, suggesting no packet loss or corruption during this period. send_seq and receive_seq are in sync, indicating no desynchronization between the host and MCU.

Clocksync Adjustments: The frequency adjustment (freq=72000380) is very slight and not indicative of a significant clocksync issue. The overall sync appears stable with no evidence of excessive jitter or drift.

Thank you for your in-depth analysis. It is greatly appreciated.

I can say that was the firs thing I tried. An increase of TRSYNC_TIMEOUT to 0.30 did not result in any improvement. I did not however step up increments of 0.05 and went strait to 0.50, which works. I would just prefer to solve if possible. The only other function of note with the connection error on these model printers is It happens more consistently when the bed is heated. Temperature of all involved components (excepting the set bed temp) during this time does not exceed 40C.

lanpartyceo avatar Dec 06 '24 14:12 lanpartyceo

I am also having this issue. I have not tried applying this patch as of yet, but will shortly. I have attached my klippy.log to see if anyone can spot an alternate explanation for the disconnects.
klippy.zip

buddybu avatar Dec 06 '24 16:12 buddybu

Hi,

I looked at the three patches here and unfortunately they do not look correct to me.

  1. The first change adds redundant code and should not make a functional difference. As Arksine points out, the reactor already schedules a new update time based on the return value of a callback.
  2. The second change would introduce a regression. It could lead to sporadic "Lost communication with MCU" errors. The self.queries_pending variable is used to track the number of consecutive times that an mcu has failed to respond to a get_clock request - the change here would result in it tracking the total number of missed get_clock requests - which could introduce failures on long running connections.
  3. The third change appears to contain whitespace, comment changes, and variable renaming that do not appear to make a functional difference.

If there is a defect in this low-level part of the code, it would help if you could first describe what the defect is.

It would also help if any fixes are not mixed with whitespace/comment/style changes as that makes it easier to review (mentioned briefly in point 6 at https://www.klipper3d.org/CONTRIBUTING.html#what-to-expect-in-a-review ).

-Kevin

KevinOConnor avatar Dec 06 '24 19:12 KevinOConnor

Thank you for your contribution to Klipper. Unfortunately, a reviewer has not assigned themselves to this GitHub Pull Request. All Pull Requests are reviewed before merging, and a reviewer will need to volunteer. Further information is available at: https://www.klipper3d.org/CONTRIBUTING.html

There are some steps that you can take now:

  1. Perform a self-review of your Pull Request by following the steps at: https://www.klipper3d.org/CONTRIBUTING.html#what-to-expect-in-a-review If you have completed a self-review, be sure to state the results of that self-review explicitly in the Pull Request comments. A reviewer is more likely to participate if the bulk of a review has already been completed.
  2. Consider opening a topic on the Klipper Discourse server to discuss this work. The Discourse server is a good place to discuss development ideas and to engage users interested in testing. Reviewers are more likely to prioritize Pull Requests with an active community of users.
  3. Consider helping out reviewers by reviewing other Klipper Pull Requests. Taking the time to perform a careful and detailed review of others work is appreciated. Regular contributors are more likely to prioritize the contributions of other regular contributors.

Unfortunately, if a reviewer does not assign themselves to this GitHub Pull Request then it will be automatically closed. If this happens, then it is a good idea to move further discussion to the Klipper Discourse server. Reviewers can reach out on that forum to let you know if they are interested and when they are available.

Best regards, ~ Your friendly GitIssueBot

PS: I'm just an automated script, not a human being.

github-actions[bot] avatar Dec 23 '24 12:12 github-actions[bot]