klipper Bulk ADC Sensors

Add Bulk ADC Sensors and Load Cell skeleton

Add support for the ADS1220, HX717 and HX711 ADC sensors.
Add a load_cell skeleton to create the sensors and start taking samples.

Is this Bug Free / Reliable?

Versions of the HX711/HX717 code have been running for weeks at a time on my test system. Others have tested this code as well and we are pretty confident this version doesn't suffer form any timing defects.

The ADS1220 code is newer but has been running for at least a week non-stop without an error. Since its SPI based it should be more reliable. The only quirk with this chip is that a reset is required before configuring it.

The only issue I'm aware of is with _finish_measurements() running when the printer experiences a SHUTDOWN. This causes the communication to the MCU to fail and logs an exception message. This is confusing for users as its at the end of the log but not the root cause of the shutdown. All of the other Bulk Sensors share this behavior. It would be nice if we could detect a shutdown in progress and skip the serial command in _finish_measurements().

Build Changes

Bulk Sensor assumes that a sensor will be SPI or I2C. Currently it would be automatically excluded on small platforms that don't have these interfaces. But the HX71x chip family uses bit-banging and can run on anything with GPIO. So I altered the build to allow the HX71x chips to be turned off. If they are included they also bring in Bulk Sensor. I'm not sure which platforms we might want to exclude this on or if it was worth bothering about.

What doesn't this do?

The implementation of a digital scale is cut from this first PR to keep the size small. The plan is to put that code in the next PR.
These sensors are all "multiplex" ADCs, meaning they have a single measurement core that is connected to a multiplexer that allows the input channel to be switched at runtime. But they can only produce 1 measurement at a time and some delay is required to switch channels (often a couple of measurement cycles). This code does not make this feature usable, in fact for the ADS1220 I didn't even include an option to select an input channel. This feature has been used by Prusa in their MK4 and XL products. They use the 'A' channel on the HX717 to real the load cell and the 'B' channel to read the the analog filament sensor. They switch inputs at runtime at a regular duty cycle and rout the measurements to separate components. I decided this is out of scope due to the complexity it would create.

Apr 08 '24 04:04 garethky

Looks like the build failed due to the increased binary size on small platforms. I'll have to do an update to exclude the HX71x on those.

Edit 1: looks like only the AR100 checks the size? Edit 2: looks like it should also be excluded on platforms that don't WANT_GPIO_BITBANGING

Apr 08 '24 16:04 garethky

Not sure I follow how that is relevant to this PR?

Apr 10 '24 19:04 garethky

Not at all, this was supposed to go into my Pr. Sorry :-(

Apr 10 '24 23:04 eazrael

I have a busy few weeks coming up and I'm not sure when I'm going to get to all the comments. If I can grab a few minutes here and there ill knock one out and just tack on a commit. When its all done we can squash git history into 2 commits as requested.

I'm not sure how I'm going to prove the sensor reset stuff works. I cant force an error on command without adding extra code. I might put in a counter that forces a reset from the MCU side every 30 seconds or something and then take that our for the submission. Will probably save this for last.

Apr 17 '24 18:04 garethky

Thanks. I'm not sure what the "sensor reset" stuff is.

There were a bunch of stuff discussed above, but putting aside various code commentary, I'd say only the following are "blockers":

This PR should be rebased onto the latest Klipper master (which contains the bulksensor.py FixedFreqReader refactor).
The mcu bulk sensor code should not call shutdown(). It should check for errors and propagate those errors to the host code.
Don't modify setup_query_command() in bulk_sensor.py.
The mcu "pending flag" code didn't look correct to me. Either I'm missing something or the code should be corrected. (This may already be done.)
The code should be split into 2 commits, but I can do that if it's a pain for you to do that.

-Kevin

Apr 22 '24 23:04 KevinOConnor

I'm not sure what the "sensor reset" stuff is. Well we need to agree on this so I know what to do.

Looking at the accelerometer code, it seems to ignore the errors. I don't see where AccelCommandHelper or ResonanceTester actually check if the data contains errors. This "carry on" approach can't work for load cells.

Here is a primer on the problem:

These chips fail at runtime due to ESD. The big metal load cell acts like a ground path and they are very susceptible to static. This causes the chip to reboot, erasing any settings it had from klippy startup. Continuing to sample after an error wont recover the sensor. Even if it does produce data it is going to be the wrong channel/sample rate/gain etc. So once the chip resets the data stream is dead.

Some chips in this class send a reboot flag with their measurements but these chips are too cheap for that. So all of these checks are heuristics to detect a reset. Even taking too long to collect a sample can change the settings on the HX71x so the next sample is garbage.

When an error is detected we have to stop sampling, re-program the chip and restart sampling. I avoided doing this because its complex. and when installed in a properly grounded printer I don't see these errors. But I'm sure at a large scale they will happen at some non zero rate.

I agree with you that sending the errors to the Host is the right thing to do. But if all the Host does it shutdown on error its a marginal improvement. Ideally I'd like to try and recover so that a stray ESD doesn't ruin a print.

Apr 28 '24 22:04 garethky

Thanks. I can only give high-level comments as I'm not familiar with the details of these sensors - so take my comments with "a grain of salt".

I don't see where AccelCommandHelper or ResonanceTester actually check if the data contains errors

I was referring to how src/sensor_adxl345.c:adxl_query() checks for a read error and populates a response of 0xff 0xff 0xff 0xff 0xff, and how klippy/extras/adxl345.py:_convert_samples() looks for that code and drops the sample (and increments the _process_batch() error counter). Also see how klippy/extras/ldc1612.py:_convert_samples() checks for warning flags in the top 4 bits of the sample, increments the "errors" counter, but continues to process the data.

When an error is detected we have to stop sampling, re-program the chip and restart sampling. I avoided doing this because its complex. and when installed in a properly grounded printer I don't see these errors.

For what it is worth, as a user, I'd not want any automatic error recovery like that. To me, restarting the sensor sounds like it is very likely to lead to overall poor measurement results. It may be possible to restart the sensor, but there is a very high chance the resulting values will not be of high quality - the nozzle/bed contact may have been missed or distorted, the timing correlation is likely off, etc. So, it may be possible to restart the sensor, but there is a very good chance I'll get a terrible first level layer height. As a users, I'd prefer the action to be stopped and to receive an error that says "your hardware is malfunctioning and needs to be fixed", so that I can then fix the hardware and go on to get great first layer quality.

As a developer, I would also prefer not to maintain complex code that, to wit, "makes it easy for users to get bad results".

So, my concern is not with reporting of errors, it is with using shutdown() to convey all errors. Shutdown is useful for internal errors (to alert developers of a coding error), for safety issues (heater outside temperature range), and for situations where the mcu is hopelessly unable to perform the requested actions (eg, very far behind on processing of scheduled timed actions). For a malfunctioning sensor, we ideally want to stop the action (eg, "sorry we couldn't perform that probe because the load-cell sensor is reporting a ground fault"), but ideally we don't want to completely turn off the motors/heaters with an arcane diagnostic message (eg, "ADS1220: Possible bad read").

Again, I can't give definitive advice on how to handle errors for these sensors. It seems to me though, that if it's a questionable sensor reading we would want to mark it as a questionable reading and let the high-level code determine what action to take. If the sensor gets in a state where it can't produce any readings, then we can signal that as well (so that the high-level code knows it wont get further readings and can raise an appropriate error message).

For this initial PR of sensor support, with a goal of getting sensor data to the api server, a shutdown() on questionable data seems a drastic response.

Thanks again. Hope that helps. -Kevin

Apr 29 '24 17:04 KevinOConnor

Thanks, I think I can work with that guidance.

As a users, I'd prefer the action to be stopped and to receive an error that says "your hardware is malfunctioning and needs to be fixed", so that I can then fix the hardware and go on to get great first layer quality.

This is going to happen via the LoadCellEndstop. When the sensor stops reporting data during a homing event, its a critical failure. The LoadCellEndstop has a watchdog timer and if no data is received for 2 sample windows it will issue a SHUTDOWN. (not in this PR)

I think an axis moving without a functioning endstop counts as a SHUTDOWN "for safety issues", this is just like a thermistor becoming unplugged.

We don't have to shutdown in the sensor code to achieve that, so they will be removed from this PR.

As a developer, I would also prefer not to maintain complex code that, to wit, "makes it easy for users to get bad results".

I also don't want to implement this due to the complexity. (Users should ground their printers!!)

The next best thing is to just stop measurements when the Host gets an error from the sensor. So inside the sensor it will not re-schedule the sampling timer on error. On the Host, I can flush the error to clients and then run the finish measurements logic which will disconnect all of the consumers from bulk_sensor and run the shutdown logic on the MCU sensor. (I don't think bulk_sensor informs consumers that they were disconnected, but in any case they will get the error first)

For something non-critical like a Filament scale it should be possible to restart measurements without halting the print. Or just let the filament scale be broken until the Host restarts. Ideally a non-critical use shouldn't ruin a print.

Apr 29 '24 17:04 garethky

I think an axis moving without a functioning endstop counts as a SHUTDOWN "for safety issues", this is just like a thermistor becoming unplugged.

A bit off topic here, but the best way to rapidly stop a homing/probing operation is to issue a trsync_do_trigger() with an error code as the "reason". It's faster than shutdown() for multi-mcu homing (as it is propagated between mcus without entering the Python code), and it also allows for better error reporting.

-Kevin

Apr 29 '24 19:04 KevinOConnor

It's faster than shutdown() for multi-mcu homing (as it is propagated between mcus without entering the Python code), and it also allows for better error reporting.

ahh good to know. Will get that fixed up.

Apr 30 '24 17:04 garethky

I got the C part in, shutdowns are gone. When an error happens the sensor just turns off its pending flag and flushes the error. That will be the last packet it sends before its restarted (if its ever restarted).

I used the possible_overflows fields to send back an error code, since that free payload that these sensors don't use.

I have to do the re-base to mainline before I can tackle the Python side

May 01 '24 03:05 garethky

I rebased and now the ROM for stm32f042 is too large. How are we deciding what to cut from these space limited chips?

May 09 '24 05:05 garethky

How are we deciding what to cut from these space limited chips?

There's no documented strategy - generally we disable whatever new feature caused the build to go over in size. It's really useful to do build all the targets that we have config files for - it catches lots of inadvertent errors. There isn't a need for these smaller chips to support every possible option though.

-Kevin

May 13 '24 00:05 KevinOConnor

rn both revisions fail on imports :(

should I start worrying and trying to finish this pr, or am I missing something obvious?)

May 19 '24 12:05 cab404

rn both revisions fail on imports :(

should I start worrying and trying to finish this pr, or am I missing something obvious?) Sorry, is this comment related to this PR (Bulk ADC Sensors)?

May 19 '24 21:05 garethky

rn both revisions fail on imports :(

should I start worrying and trying to finish this pr, or am I missing something obvious?) Sorry, is this comment related to this PR (Bulk ADC Sensors)?

Yes, and I mean both commits in the branch

May 19 '24 23:05 cab404

Yes, and I mean both commits in the branch

I just need to get some time to work on it. Looks like I have to exclude the hx71x sensor from all the places that would support it but don't have enough space left in their image to do so. I don't have the local tooling on my Mac setup to test this so I'm doing commits and having github do the testing.

May 20 '24 05:05 garethky

I don't want to rush you, sorry if that sounded like it — I just wanted to ask if any help is needed.

Thank you for working on this!

May 20 '24 10:05 cab404

I don't want to rush you, sorry if that sounded like it — I just wanted to ask if any help is needed.

Thank you for working on this!

I don't have a way to compile and run the github test suite on my Mac. I looked at some setups but couldn't make it work.

May 23 '24 01:05 garethky

I don't know the actual rom size limits for boards. But just as a "crazy" suggestion, maybe in the future. Technically, ROM can store klipper.dict in any format. Json compressed is simplest of course.

For now, the simplest way to squish more is to use a different compression algorithm or use dictionary compression. Just for reference:

$ du -bs klipper.dict*      
7741    klipper.dict # size on master branch
2442    klipper.dict.br # google brotli (uses text dictionary internally)
2839    klipper.dict.gz # zlib compression -9 
2986    klipper.dict.zst # Zstd default
1148    klipper.zstd.dict # Zstd with dictionary trained over current klipper.dict

May 23 '24 01:05 nefelim4ag

stm32f401 - "Too many message ids"

So this PR can't be merged until klipper gets 2 bytes message IDs (or these sensors are excluded from all tests, which seems wrong)

I'll keep working on the other outstanding issues

May 23 '24 02:05 garethky

I don't have a way to compile and run the github test suite on my Mac. I looked at some setups but couldn't make it work.

What Mac do you have?

May 23 '24 10:05 cab404

What Mac do you have? Intel

May 24 '24 16:05 garethky

Last word I know about this from Kevin: https://klipper.discourse.group/t/scalability-of-too-many-message-ids/8023/4

"Using a VLQ for the command id was just never implemented in the mcu. Implementing it is likely the best long-term scalable solution."

May 26 '24 18:05 garethky

Intel

Hm, and this tool doesn't work? https://github.com/nektos/act

May 27 '24 13:05 cab404

I made a change to the HX71x code to disable interrupts for the entire read cycle. This should last about ~15us. I made the change because of 2 errors that I saw happen while printing:

"Time too close" while bed meshing. These would usually happen in the first couple of probe points. Multiple people have reported this.
ERROR_READY_AFTER_READ happening randomly while printing, indicating that the read was delayed long enough for a new sample to arrive.

Combined with seeing ERROR_READ_TOO_LONG when the MAX_READ_TIME was set to 200us, its pretty clear that the code is being interrupted for a long time. This seems to happen when the serial queue sends instructions to the step queue. The code change seem to resolve the issue but I'm concerned about the increased time with interrupts turned off.

I was trying to follow what neopixel does and only disable interrupts for a minimum of time. So whats reasonable to turn interrupts off for?

May 31 '24 20:05 garethky

I think I have a good solution now:

I'm measuring the start and end times outside of taking the sample and assuming they include IRQ interruptions that extend the execution time. So I've limited the execution time to 1/2 the sample period. This seems to be "fair" as no single module should be taking up 50% of the wall clock time to get executed.
I moved the check for the data ready pin inside an IRQ fence so its checked at the exact time we expect the chip to turn it off. Checking it any later is fragile as IRQ delay may cause it to be sampled after it comes back on.
I've gone back to the delay between clock pulses using irq_poll(). This tries to limit the time the IRQs are turned off to the absolute minimum. This is what neopixel and prusa_buddy are doing and was the original intent.

I haven't see the "Timer too close" error come back but I've only had 2 days to test this.

I've implemented rudimentary sensor resets when an error is reported. They are run from the host side. The sensors stop scheduling themselves when they report an error and flush whatever readings they have to the host. The host also now logs an error when it sees one of the error codes from the sensor. Later the load_cell can cause a shutdown if it sees an error in the middle of a probe. I haven't tested the reset stuff because I cant really cause it at will, but I will get to that soon, probably with a temporary count based trigger in the sensor code.

And I've split the code up into 2 commits. One for the basic load_cell skeleton and hx71x and a second adding the ADS1220.

Jun 03 '24 05:06 garethky

Intel

Hm, and this tool doesn't work? https://github.com/nektos/act

I tries and it didn't work, build ended with an error related to building the doc site. I didn't have time to do any debugging.

Jun 04 '24 20:06 garethky

I merged in the patch from Frederic Lauzon [email protected] (I think frlauzon here on github) that makes msg_id a VLQ. I need this to build my code now that we have run out of single byte message ids.

Jun 05 '24 16:06 garethky

2-byte message ID support is on the way from @KevinOConnor: https://github.com/Klipper3d/klipper/pull/6613 Will re-base once this is merged.

Jun 06 '24 19:06 garethky