Open-Source-Ventilator icon indicating copy to clipboard operation
Open-Source-Ventilator copied to clipboard

Program / EEPROM memory integrity check library

Open ermtl opened this issue 4 years ago • 13 comments

Sometimes, mostly after a few years (or less in case of hardware defect), processor program memory or it's EEPROM start to lose their content, sometimes in elusive / voltage and/or temperature dependant ways. When that happens, the program will behave erratically, hang, and fail to execute it's intended function. Same can happen if the EEPROM is corrupted.

In a project where the machine's proper functioning is critical, this can't be allowed to happen.

Newer AVR processors (and others) offer hardware program / EEPROM memory integrity check.

A complete description can be found in the ATMEGA4809 datasheet starting at page 370.

For processors that don't have it, the function can be made using software. The basic function is to scan the whole program memory and calculate a CRC. The computed CRC is compared to several copies of the previously calculated value stored in multiple locations on the EEPROM. If it fails, the program acts accordingly (generally it should just write an error code to EEPROM and stop. Upon the next start, an error message should be displayed to alert the user that the machine could be unreliable. Same can be done with EEPROM (must be separated as the operation should be done after each EEPROM update).

I could not find a library that implement this. Would anyone be interested in programming one (as a separate, independent project) ?

ermtl avatar Mar 25 '20 14:03 ermtl

EEPROM corruption is probably the most likely failure to happen, since the MCU could shut down while writing to the EEPROM and corrupt it.

However, here's what is said in the ATMEGA328P's datatsheet about data retention:

Reliability qualification results show that the projected data retention failure rate is much less than 1 PPM over 20 years at 85°C or 100 years at 25°C.

Since this project is more for emergency and that the MCU shouldn't get used for 20 years, we can expect even less than 1PPM to fail. It would have been nice from Atmel/Microchip to give us a graph of the number of PPM to fail based on their age and the temperature.

ZakCodes avatar Mar 25 '20 14:03 ZakCodes

I recall talks on arduino forum where people tested arduino's eeprom failing at up to 1M write cycles. The specs is 100k cycles. So.. I wouldn't worry too much. If power fails there is a more immediate problem than what happens when the arduino reboots.

Blimpyway avatar Mar 25 '20 14:03 Blimpyway

A blank eeprom is all 0xFF I think. Power failure exactly in that sub- millisecond which arduino needs to update a value is very unlikely.

Blimpyway avatar Mar 25 '20 14:03 Blimpyway

Most Arduino projects are controlling gizmos that blink. This one is about people's lives. a bug that results is over pressure in the lung can kill the patient or damage their lungs and make them short breathing for life. If you were the patient, would you take "I wouldn't worry too much" for an answer ? That's why automotive and medical designs require these kind of checks.

ermtl avatar Mar 25 '20 14:03 ermtl

@Blimpyway I agree. It is really unlikely, especially considering that data is only written to the EEPROM whenever a user is changing its configuration. However, it shouldn't be too hard to compute a CRC of the EEPROM every time we write to it, so I think it's worth it if it can prevent an accident.

ZakCodes avatar Mar 25 '20 14:03 ZakCodes

Ok. Simple xor-ing 255x32bits int (or even adding) chunks and writing the result on the 256-th should be more than enough. On reboot xor-again, if the sum does not match just beep an alarm and do nothing. Whoever operates it will have to dial in all settings. If that is ok for you I'll write it

Blimpyway avatar Mar 25 '20 14:03 Blimpyway

There are reasons why CRCs exist, and that's because they are much more resistant than basic XOR. When memory corruption errors occur, that's often in clusters. If 2 bytes are affected, a CRC check won't see it. That's why CRC16 or CRC32 is needed and the check has to be done on the program memory as well as the EEPROM.

The mindset when developing safety critical devices is very different from the maker mindset where the job is considered done as soon as the main function kind of works. Here, we must be sure that it will always work as intended.

https://en.wikipedia.org/wiki/Cyclic_redundancy_check https://en.wikipedia.org/wiki/Hamming_distance

ermtl avatar Mar 25 '20 15:03 ermtl

failtest.zip

Here-s a zip file with

  • ee_failsafe.ino which can be added as a tab to your project.
  • failsafe.ino - just to show how the two functions eeprom_write_crc() and eeprom_test_crc() should be used.

And yes it's using a slightly modified CRC32 sum I picked from arduino.cc examples. The change just skips the EEPROM memory slot where the crc itself needs to be saved. This address can be redefined in ee_failsafe.ino e.g. #define SKIP_ADDRESS 16

Blimpyway avatar Mar 25 '20 15:03 Blimpyway

feel free to ask me if anything isn't clear.

Blimpyway avatar Mar 25 '20 15:03 Blimpyway

If we-re to share thoughts about critical systems guidelines, then people lifes should not depend on software/mechanics developed in <4 weeks no matter how good designers are.

Blimpyway avatar Mar 25 '20 15:03 Blimpyway

That was fast ! I'll test it tonight. Can you also add the ability to scan the program memory ? The test should be done only until the end of the program, else, previously written programs will appear as random garbage and the value will not be the same from one device to the next.

I agree with you about the short time, unfortunately we don't get to decide about that, that's why it's an emergency. Normally such stuff takes months if not years for a whole team to develop.

ermtl avatar Mar 25 '20 16:03 ermtl

If you really want it to be safe, take it functionally. I mean have a function that checks correct pressure swings are registered and reset the wdt in the same place. Medics already want to be alerted if pressure gets too low or too high. Add an alert for pressure not changing. This covers pressure sensor / I2C failure, air leakage, bag exploded stepper unplugged and whatever other part of code not going right. If pressure swings are recorded at the mouth of the patient then chances are s/he is not dead yet. If sensor contamination is acceptable (it might be autoclavable since it is made to resist soldering) it can even sense exhalation temperature or, at least it can tell the patient isn't entirely cool.

Blimpyway avatar Mar 27 '20 01:03 Blimpyway

One solution to resolve EEPROM write problem is, soft shutdown circuit, So MCU can shutdown whole system.

2nd precaution is to save EEPROM values on multiple locations at same times. Means one value is being saved in three EEPROM locations at same time. On reading, read three locations and check if three are same?

3rd one is add separate safe guard circuits for critical things. Hope it helps.

mawaisbadri avatar Mar 28 '20 16:03 mawaisbadri