klipper klippy: add stallguard monitoring tool

Hi, as https://github.com/Klipper3d/klipper/pull/6139 is merged, I can now show the stall guard measurement stuff. I'm not sure that it is the perfect solution to poll without batch processing, and not sure about timings/frequency. I try to mimic ADXL345 code, to avoid adding any new complexity. My discourse thread with more graphs, and some comments.

So, there is the problem that I'm trying to solve: Coolstep is simple and hard to configure because there is no near real-time visibility from the user side. The original TMC solution is to directly reproduce and measure your load. We can't see Stallguard pin output outside of homing actions, to track stalls. But we can poll it.

*And there are many different printers and half of them now are CoreXY, which makes things tricky here.

High-level solution:

Monitor driver for status
Store History with stallguard/speed/current data.
Make it human-readable.

For now, I added the command: MEASURE_STALLGUARD STEPPER=<stepper> It will start to poll/filter/store data, and on the second call - store them in csv.

cat /tmp/stepper_y-20240515_024227.csv 
#time,velocity,sg_result,csactual
43668.927338,191.3,67,30
43668.952901,203.8,67,28
43668.977954,203.8,67,28
43669.003832,203.8,67,28
43669.363363,203.8,74,27

There is a dump script for data processing:

scripts/calibrate_stallguard.py /tmp/stepper_y-20240512_010858.csv -o ~/klipper/stepper_y-20240512_010858.png

Example output for one hour print with CoolStep enabled, unfiltered, cumulative graph.

~/klipper/scripts/calibrate_stallguard.py stepper_y-20240515_004650.csv -o stepper_y-20240515_004650_cumulative_unfiltered.png -e 100000

As you can see there is crazy noise on low velocities, also there are crazy speeds calculated from step timing - possibly because of step compression.

Let's filter that, and analyze it together. (min 1 mm/s, max 200 mm/s)

~/klipper/scripts/calibrate_stallguard.py stepper_y-20240515_004650.csv -o stepper_y-20240515_004650_cumulative_low.png -s 1 -e 200

There is StallGuard noise below ~35 mm/s - not enough back EMF. The suggested SG range for StallGuard2 is in range 0..100, assuming there is more or less correct SGT (-10..0..10) value (mine is 1 or 2, depending on the motor). We can track min & mean SG value, till they stabilize. Min becomes zero on speeds above ~125 mm/s (> 3 RPS) Mean becomes acceptable slightly early at ~80 mm/s. (2 RPS) *CS is 31 on TMC5160 with reasonable sense resistor and equal CS max here. The minimum threshold for Coolstep is 32, because SEMIN must be > 0, SEMIN = 1, (SG = SEMIN * 32)

So, I picked up the value 150 mm/s as a threshold with a safe margin, and I know my printer is limited to 1000 mm/s. (let's just ignore high-speed diagonal moves here)

So, the last one with adequate range.

~/klipper/scripts/calibrate_stallguard.py stepper_y-20240515_004650.csv -o stepper_y-20240515_004650_cumulative.png -s 130 -e 1000

Mean SG is around ~60, if SG is above 64 current can be reduced SEMIN=1, SEMAX=0 ~ (SEMIN + SEMAX+1)*32 If below 32, the current should be bumped up.

There we are, with enabled CoolStep there is a mean CS of around 25 and min down to 16, because of SEIMIN=0 (50%). 25/31 * 100 = 80 %, 20% current reduction on medium velocity moves.

There is missing part, I didn't test the "high" velocities where backEMF is too big, because for me they are too large. So, I can suggest assuming half of the supply voltage as back EMF is a safe margin, according to the graph. Because up to 1000 mm/s there is still adequate response from CoolStep in my test.

My 2 cents: I have only TMC5160 48V & TMC2209, so only this combination was tested. The motors are LDO 2504AC. I initially tuned SGT values by sensorless homing. I so far have no issues with Coolstep, with almost default values. CoolStep threshold will control sensorless homing, and IIRC, sensorless homing speed must be above that threshold. On my printer, sensorless homing works fine with 50 mm/s, but with CoolStep at such a low threshold and with allowed reduction SEIMIN=1 (25% of CS) leads to skipping steps. With an extruder, CoolStep can work, but its usefulness really depends on setup and extrusion flow. On orbiter 2, flow shall be above 25 mm3^s. to get 2 RPS.

Will be happy to hear any feedback, even if these things are useless. Thanks!

Raw data: stepper_y-20240515_004650.csv

May 07 '24 14:05 nefelim4ag

Thank you for submitting a PR, pleas refer to point 3 in "What to expect in a review" in https://github.com/Klipper3d/klipper/blob/master/docs/CONTRIBUTING.md and provide a signed off by line.

Thanks James

May 07 '24 15:05 JamesH1978

I will update that branch with measurement stuff, as current changes superseded by: https://github.com/Klipper3d/klipper/pull/6139

And I will fix CI warnings of course.

May 15 '24 00:05 nefelim4ag

Thanks. I have a few high-level comments:

I can see where it would be useful to gather this information, and it seems like a useful feature.
I think it would be preferable to add this to the existing "motan" system instead of creating a new set of custom text files and graphing tools. That is, it would be preferable to use bulk_sensor.BatchBulkHelper() to gather the samples and write them out over the API server. You can take a look at how motion_report.py uses this helper to gather and report results - adxl345.py is another example. You can also look at scripts/motan/ and https://www.klipper3d.org/Debugging.html#motion-analysis-and-data-logging for information on analyzing the data. Using the "motan" system allows for cross-analysis with motor movements and other sensors.
The cs_actual is a good catch and makes sense to change.
Minor, but I think it would be preferable to add this new logic to a new class in tmc.py (eg, TMCStallguardDump()) instead of adding it to TMCCommandHelper(). You can take a look at how TMCCommandHelper() instantiates TMCErrorCheck() as an example.
Also minor, but I don't think it is necessary to change all the tmcxxxx.py code to activate this feature - the TMCStallguardDump() code can self activate whenever there is a cs_actual field present (or some other appropriate field).

Cheers, -Kevin

May 20 '24 00:05 KevinOConnor

FYI, I committed the csactual -> cs_actual change to the master branch (as well as correcting the size on tmc5160) - commit b6a00632.

-Kevin

May 22 '24 00:05 KevinOConnor

bulk_sensor.BatchBulkHelper() reused, api endpoint names/format up to you (I just set something). BTW I initially wrongly assumed it is coupled to sensor bulk mcu command - so tried to do magic around spi unsuccessfully. :).
Thanks, rebased
I also think so - moved to a separate class.
yep, done.

Currently, I'm in motan, trying to do any useful graphs with this. I feel like a simple return of value makes most of graph empty/zero

    def pull_data(self, req_time):
        jmsg = self.jdispatch.pull_msg(req_time, self.name)
        ....
        if jmsg is None:
            return
        time, velocity, sg_result, cs_actual = data[0]
        if self.filter == "sg_result":
            return sg_result

P.S. We can actually return sg_result/cs_actual in a standstill, but sg_result would be meaningless. cs actual can show work of IHold and maybe lose of microstep precision in pair with encoder?

May 22 '24 00:05 nefelim4ag

Not sure, but as a guess the code is confusing systimes with printtimes ( https://www.klipper3d.org/Code_Overview.html#time ). The motan system is expecting all the times to be print_times.

-Kevin

May 22 '24 01:05 KevinOConnor

So, more or less done, I'm not sure that I'm correctly getting/calculating print time, maybe estimated_print_time() is better.

Feels like the perfect solution is to get SPI answer clock timing, then convert it to print_time somehow, but this looks like too much work I guess.

Thanks.

Looks like a shift in ~250 ms =\ Maybe I need to take into account a buffer time like in the toolhead.py code.

May 22 '24 22:05 nefelim4ag

The estimated print time is overall better here, now shift is around 15 ms. In my setup query, TMC uses around 1 ms.

But I have no idea how to get better accuracy here.

May 23 '24 17:05 nefelim4ag

Thank you for your contribution to Klipper. Unfortunately, a reviewer has not assigned themselves to this GitHub Pull Request. All Pull Requests are reviewed before merging, and a reviewer will need to volunteer. Further information is available at: https://www.klipper3d.org/CONTRIBUTING.html

There are some steps that you can take now:

Perform a self-review of your Pull Request by following the steps at: https://www.klipper3d.org/CONTRIBUTING.html#what-to-expect-in-a-review If you have completed a self-review, be sure to state the results of that self-review explicitly in the Pull Request comments. A reviewer is more likely to participate if the bulk of a review has already been completed.
Consider opening a topic on the Klipper Discourse server to discuss this work. The Discourse server is a good place to discuss development ideas and to engage users interested in testing. Reviewers are more likely to prioritize Pull Requests with an active community of users.
Consider helping out reviewers by reviewing other Klipper Pull Requests. Taking the time to perform a careful and detailed review of others work is appreciated. Regular contributors are more likely to prioritize the contributions of other regular contributors.

Unfortunately, if a reviewer does not assign themselves to this GitHub Pull Request then it will be automatically closed. If this happens, then it is a good idea to move further discussion to the Klipper Discourse server. Reviewers can reach out on that forum to let you know if they are interested and when they are available.

Best regards, ~ Your friendly GitIssueBot

PS: I'm just an automated script, not a human being.

Jun 07 '24 00:06 github-actions[bot]

Thanks.

I fear I may not have understood the goal of this PR when I first reviewed it.

Is your goal, 1) to merge a tool for configuring coolstep, 2) to merge support for analyzing tmc drivers for motion analysis, 3) to not merge but discuss possible changes, or 4) something else?

If the goal is to merge a tool to configure coolstep, then I think the tool would need to be tested by a number of interested users, and have feedback showing the resulting calibration improves real-world results for a notable audience.

If the goal is to enhance motion analysis, then I think the PR should be streamlined for that goal. In particular, I'd avoid adding a MEASURE_STALLGUARD command, avoid the custom data file format, and avoid adding the calibrate_stallguard.py script. It may also be worthwhile to consider dumping the raw tmc fields (eg, DRV_STATUS, SG_RESULT) to the output instead of processing it in the main klipper process (that is, let the user, or motan scripts, post process the fields of interest).

But I have no idea how to get better accuracy here.

I'm not sure there is a way to get good timing accuracy with the tmc drivers. The current code seems to query three fields (TSTEP, DRV_STATUS, SG_RESULT) which means three round-trip times between host -> mcu -> driver. Inherently that's going to be difficult to get good timing with. I'm also not sure why you want TSTEP as we already know how frequently Klipper sends steps to the driver (but maybe I'm missing something).

That said, the most accurate timestamp is likely to pull out the params['#receive_time'] from the low-level self.spi.spi_transfer() (or self.tmcuart_send_cmd.send()) request. Converting the systime to eventtime (as you are doing) is likely a lot easier though, and likely only marginally less accurate.

Cheers, -Kevin

Jun 08 '24 00:06 KevinOConnor

TLDR: Something in between. Let's focus on motan.

My initial goal is to get CoolStep working thresholds (Superseded by: https://github.com/Klipper3d/klipper/pull/6139). My second goal is to add any tooling to work with it. The TMC datasheet states to monitor "live" SG value & etc, and it is just impossible to do right now - that is what I initially tried to do. That said, I'm flexible here. Motan & batch logger is a more powerful/flexible solution - it is implemented.

Of course, I will be glad if someone has an interest and also gives it a try, but for now in my 3d printing circles, no interest so far. I agree with not adding a new command and tooling, even if it feels simpler. Motan interface is more complex of course, but superior. IIRC they are in separate commits, so can be easily dropped/skipped. I should recheck and squash some commits, of course, they are here for the history of war against timing precision

About raw fields, maybe they are just slightly different between drivers. 2130 & 5160 have slightly different numbers of fields in status. SG_RESULT exists in 2209, but not in 2130/5160 2240 use both.

Feels like very complex to postprocess, no? As there is "low" frequency dump, maybe just dump them parsed as a dictionary?

If you decide to go for raw values, I think, I should rename the code to something like "tmc dump", "tmc status". Because the current stallguard_dump is misleading.

TSTEP was initially used because it is a direct drive sense of step frequency. After all, I has no way to dump the raw speed of a motor, and TSTEP allows matching it against thresholds.

With motan:

I don't know if there is a way to show stepper speed for CoreXY. But it can be calculated of course.
as stealth threshold for 2240?

But I also feel it is wasteful in some way to collect TSTEP.

Yep, it looks like passing low-level data to TMC is a way, but I didn't have enough courage to pass it through. So, If it is fine for now - we can leave it as is.

Jun 08 '24 20:06 nefelim4ag