Repetier-Firmware icon indicating copy to clipboard operation
Repetier-Firmware copied to clipboard

dev2 - Z-probe sometimes ignored - crash into bed

Open kyrreaa opened this issue 4 years ago • 35 comments

During large probing sessions the back move is apparantly not executed correctly and motion continues to bed. Popping probe up to inactive state. (In this state it reports H as if it was triggered.) Instead of aborting and backing off and reporting an error, the fw continues to grind into bed.

If I deactivate the probe before a G30 to test it jiggles about and aborts with message to host about error, so this works as expected.

I have also noticed that some of the back-moves after probe hit is done at wrong speed and I suspect this is related.

kyrreaa avatar Mar 01 '20 15:03 kyrreaa

There are some magic numbers in the code "Motion1::stopMask = 4; // z trigger is finished" that seem odd. It basically refers to A_AXIS and not ZPROBE_AXIS?

kyrreaa avatar Mar 01 '20 15:03 kyrreaa

Mask is a bit field, so 1 = x, 2 = y, 4 = z, 8 = E, 16 =a, ...

What probe type are you using? In other thread you said type 2 but that is nozzle as probe which has no offset to activate. There only position is moved to valid region = bed minus safety border.

I guess you use type 1 = default zprobe or 3 = BLtouch.

repetier avatar Mar 01 '20 16:03 repetier

Yes, sorry. I use type 1. This kind of looks like race condition or interrupt issue to me. I find it odd though that it happens very rarely. If it was an instance of buffered state not being updated (classic volatile stuff) then it would happen all the time.

There is the possibility though that the previous move to location triggers the endstop to snap to disabled and that this happens at the exact point the code switches over to probing. Since it checks the endstop only once before recalculating the path needed for the probe it could possibly end up not seeing the transition as it queues the move and starts to wait? If the transition happens before endstopMode = EndstopMode::PROBING would it not be ignored?

kyrreaa avatar Mar 01 '20 17:03 kyrreaa

Not sure I can follow here. Looking at float ZProbeHandler::runProbe() I see that it return without any move if probe is already triggered. Inside the for loop it does measurements and always moves back again to start position, so will in theory never stop at bottom. Then it is updated again.

One problem are zprobes where high signal is only short like BLtouch. That is why we have that as special case to test also fro probe error state. That is the only probe I have for testing at the moment. Hope I can test soon - printer is not 100% functional at the moment a sit is a delta and I also need to check delta function working in parallel. Anyhow in this code segment

        // Go up to tPos2
        FOR_ALL_AXES(i) {
            int32_t df = tPosSteps2[i] - tPosSteps3[i];
            corSteps2[i] = ((df > 0) ? 1 : ((df < 0) ? -1 : 0)) * (labs(df) - Motion1::stepsRemaining[i]);
        }
        Motion1::moveRelativeBySteps(corSteps2);
        Motion1::waitForEndOfMoves();

The head should always move back. What is your Z_PROBE_REPETITIONS value? On repeated moves it does not move full distance back and there could be miscount.

BTW: What is your test script you are running here? Just to understand the condition under which the problem happens.

repetier avatar Mar 01 '20 17:03 repetier

I am running bed map from Repetier Host with 20x20 points. I added some code in the runProbe to investigate. I moved the code in waitForEndOfMoves out into a local while loop and added tests for !Motion1::axesTriggered && ZProbe->update(). This will run at loop speed with any free cpu resources as the system spins anyways. Thus no impact on performance. I then Print a warning that there has been a pin change isr miss on probe and count iterations of loop that sees the same state. After 3 iterations I print a new warning about aborting and abort the loop. I do see my message printed every 10 probes or so, but I do not see the second message. My theory is that the first message can come every once in a while due to timing conditions but since the pin change should have higher priority than main it should not be many iterations of delay.

kyrreaa avatar Mar 01 '20 17:03 kyrreaa

I only do single probes. No repeated with averages.

kyrreaa avatar Mar 01 '20 17:03 kyrreaa

My understanding is that only updateReal() in EndstopSwitchHardwareDriver will update correctly. Calling update() should just read the state set by the updateReal(). Since this appears not to always end the iterations I can assume that somehow the state change is registered before the status to actually update/probe is added.

My suggestion is to check the flag again before waiting, but after setting the states to probe.

kyrreaa avatar Mar 01 '20 17:03 kyrreaa

I am trying to add an error message right before the moveByPrinter() to catch this. If it triggers then it would confirm something like this is happening. Hopefully I can fix it before it breaks my "one of a kind" hotend.

kyrreaa avatar Mar 01 '20 17:03 kyrreaa

What endstop type are you using? I currently always use ENDSTOP_SWITCH_HW which triggers an interrupt and stop move directly. The polling solution always calls update() to get updated state on each step.

updateReal() is special case for hardware endstop. That is the function called by processor on signal change and updates state for later use in update(). Looking into it you would need to have the end stop for z probe defined as zaxis, false (dir -1) so it stops motion

          if (axis >= 0 && newState) { // tell motion planner
                endstopTriggered(axis, dir);
            }

as that is the only way it would tell it if you have #define NO_SOFTWARE_AXIS_ENDSTOPS to speed up endstop testing. If it does not stop then it would move the z to the end loosing steps and steps remaining is 0 and does go up to higher position.

repetier avatar Mar 01 '20 17:03 repetier

Nope. Did not catch it... It crashed into bed again. This time I simplified my test. Just a positioning of head to a x,y coordinate and running a script with 200 probes. G30 P1 G30 P0 G30 P0 ... G30 P2

It just kept dinking into the same spot untill it suddently kept on going and crashed into bed. This pops the probe up as I mentioned so it will stay at high level. Still it did not stop. I confirmed it is sending high after popping. Adding the check for probe state in the loop instead of waitForEndOfMoves() seemed to work. I will have to retest to confirm but it did not appear to crash when testing there. I could be just lucky though.

kyrreaa avatar Mar 01 '20 17:03 kyrreaa

And yes, I use ENDSTOP_SWITCH_HW and that is the issue I think. Somehow it does not stop the move.

kyrreaa avatar Mar 01 '20 17:03 kyrreaa

Can you post your 2 configurations. Guess it might be important which exact settings you have here. You could also try without pushing endstops and use polling instead to see if that is the problem. Will test as soon as my time allows. Depending on what it is it could also happen to homing where the same technique to stop is used.

repetier avatar Mar 01 '20 18:03 repetier

I guess I don't home as often as I probe so statistics is in my favour there. I can push to my fork so it can be accessed without having to fill the issue up with files.

kyrreaa avatar Mar 01 '20 18:03 kyrreaa

Also great, just give me the link then and I check there.

repetier avatar Mar 01 '20 18:03 repetier

https://github.com/kyrreaa/Repetier-Firmware/commits/kyrre_dev2_corexyz There you go.

kyrreaa avatar Mar 01 '20 18:03 kyrreaa

I will switch the z probe to polled and compare.

kyrreaa avatar Mar 01 '20 18:03 kyrreaa

Wow. Well, changing the z probe to a simple polled probe makes it just ignore it all together. M119 shows it is changing states properly but it is being ignored. It also behaves more correctly on the homing part as it does not appear to rehome x and y before probing when it is a sw probe.

Edit: I ofcource forgot to remove NO_SOFTWARE_AXIS_ENDSTOPS.

Edit 2: Nope. Still ignores it...

kyrreaa avatar Mar 01 '20 18:03 kyrreaa

I think I found something. In WInterrupts.c the pin change interrupts are initialized with NVIC_SetPriority(PIOD_IRQn, 0); in my case for the endstop input on Ramps-FD I use as probe. In fact all ports are inited to level 0. I use Zmin btw, which is PIN D26 aka GPIOD.1 but that is irrelevant when it does so with all... It is therefore possible for the pin change to happen and interrupt the MOTION3 level as it is configured for DUE in HAL.cpp as level 2.

kyrreaa avatar Mar 01 '20 20:03 kyrreaa

Could it be as simple as motion trying to move to next segment of travel while being interrupted by pin-change which sets flags to abort in the active segment. The active segment is then discarded when motion resumes and it can no longer see it should have stopped?

kyrreaa avatar Mar 01 '20 20:03 kyrreaa

WInterrupts.c is part of the core arduino framework by the way, so a workaround or safeguard would be needed in motion...

kyrreaa avatar Mar 01 '20 20:03 kyrreaa

I am using quite slow accelleration so it is possible that the move is split into multiple segments to handle the accelleration and the hig happens just as it runs activateNext()? This would allow the probe hit to affect the old segment and then upon continue the new segment replaces it.

kyrreaa avatar Mar 01 '20 20:03 kyrreaa

Well... I thought I could bypass it by removing #define NO_SOFTWARE_AXIS_ENDSTOPS and letting the code check as normal. This did not work. I expect the issue has to do with paths, segments and timing so I made a new testscript that changes to relative mode (g91) and increases z by 0.01mm (g0 z0.01) before changing back to absolute mode (g90) between each probe (g30 p0). The script starts with probe almost touching and a "g30 p1". After reaching about the height I use to probe over my slightly warped bed it crashed.

kyrreaa avatar Mar 01 '20 21:03 kyrreaa

Motion3 should have highest priority, but I saw it was set to 3 in hal.cpp. Changing it to NVIC_SetPriority((IRQn_Type)MOTION3_TIMER_IRQ, 0); // highest priority - no surprises here wanted should solve that problem with priority.

Looking into code I think skipParentId would be the crucial variable. It gets set to force Motion3 to finish line, but only activateNext changes it if the line is finished meaning z move is executed. So even if we missed the check it means 1 extra step until move is popped.

I have tested it 100 times on my delta G1 P1 + 18x G30 P0 then G30 P2 no errors. That is on a stm32F4 board where interrupt priority is 0. So not sure if it priority, other speed or just luck. Best you try with motion3 priority 0 and see if it improves.

I also got no shift of position, but maybe I need to move xy in between to get that error.

repetier avatar Mar 02 '20 14:03 repetier

I think I discovered the reason for the shift. Basically I had z home height set to 0 which disables any move avter homing, but I had tool change z lift to 2mm which tried to move z up by 2mm after homing. Since it was already at Zmax height (after backing off the endstop 2mm) it could not move up and so reset the machine homed and said it cannot be trusted. The rehoming done for probing rehomes x and y but not z so it is kind of odd. Changing the z home height to within the work envelope fixed this.

As for the priority, pin change interrupts appear to have higher priority than timer interrupts so even at equal priority ranking the pin change will win. Solution is to make sure the operation in timer3 is safe.

kyrreaa avatar Mar 02 '20 14:03 kyrreaa

The reproduction is very dependant on machine accelleration I think. I made my script like this:

` G30 P1

G91 G0 Z0.01 G92 G30 P0

G91 G0 Z0.01 G92 G30 P0

G91 G0 Z0.01 G92 G30 P2 `

The middle two repetitions would then be copied a LOT of times. (2-400) and it would probe same spot over and over. Starting with probe almost touching. Mine has accelleration set to 500mm/s^2 and it fails at around 3mm. It may also be dependant on the segment count or such if exists... Deltas are different from linear machines there I guess.

I also want to repeat that it did not work even if I removed the NO_SOFTWARE_AXIS_ENDSTOPS which could mean the polled method is vulnerable too.

kyrreaa avatar Mar 02 '20 14:03 kyrreaa

There is an error in MotionLevel3.cpp line 166

        if (testMotorId == Z_AXIS && Motion1::endstopMode == PROBING) {
            if (ZProbe->update()) { // ignore z endstop here
                Motion2::endstopTriggered(act, act->directions & axisBit, false);
            }
        }

I think it should be

      if (Motion1::endstopMode == PROBING) {
           if (ZProbe->update()) { // ignore z endstop here
               Motion2::endstopTriggered(act, Z_AXIS, false);
           }
       }

Explains at least why polling for z probe does not work.

Regarding interrupt - you are right as I understand the lowest vector is the tie braker here. So adding

    NVIC_SetPriority(PIOA_IRQn, 1);
    NVIC_SetPriority(PIOB_IRQn, 1);
    NVIC_SetPriority(PIOC_IRQn, 1);
    NVIC_SetPriority(PIOD_IRQn, 1);

to void HAL::setupTimer() should help here. When called it blocks interrupts

void endstopTriggered(fast8_t axis, bool dir) {
    InterruptProtectedBlock noInt;
    Motion2::endstopTriggered(Motion3::act, axis, dir);
}

so we only need to prevent calling it while in motion3, also I still do not see how this can cause the side effect but it makes sense regarding the frequency it happens.

Would be great if you can recheck. Will also try tomorrow if I can get it with my setup - but different processor alone could already prevent it from happening. Maybe I should test on my felix also it has other z probe type, but same timings.

repetier avatar Mar 02 '20 18:03 repetier

Very cool. I've been avoiding the bed map so far due to this and focusing on the tuning of the retracts etc. Later I will run direct extrusion but I need to design the extruder first...

Here's a clip showing Repetier Firmware (with slight patch) run a real Core-XYZ! https://youtu.be/B8z4qNUmVlg

kyrreaa avatar Mar 02 '20 18:03 kyrreaa

Hey! Back from a little absence. Ah... I had this same exact problem a while back ago but I had assumed it may have just been something I broke. I realize now I should have probably made a github issue about it, apologies. I don't have a full solution, but I have a workaround from what I last traced down. Hopefully this might help narrowing the bug down or at least echos what's been posted so far.

But one first bug:

I have also noticed that some of the back-moves after probe hit is done at wrong speed and I suspect this is related.

I'm not entirely sure if it's related, though I had this issue as well.

In MotionLevel1.cpp line 784, in the function: Motion1::moveRelativeBySteps(int32_t coords[NUM_AXES])

buf.length would occasionally already have something it was set to, affecting the planner's speed. Clearing it to zero just as startSpeed and endSpeed are appears to fix that issue. buf.startSpeed = buf.endSpeed = buf.length = 0;

As for the main bug, endstopTriggered seems to be called properly, however the Motion3::act pointer is occasionally null so the function doesn't execute further. It appeared to be a sort of race condition between interrupts but I never got to pin it down fully at the time.

My fix was to attempt to skip to the next act via Motion3::activateNext() if it's a null pointer and we're also probing. A bit dirty but fixed it for the time being. The vast majority of the time this actually works and obtains the correct next act, and I can usually get a 400 point bed map to work. But in some seemingly super rare occurrences it'll still fail, in which case I kill the printer rather than have a bed crash.

In MotionLevel2.cpp, line 557

void Motion2::endstopTriggered(Motion3Buffer* act, fast8_t axis, bool dir) {
    if (act == nullptr || act->checkEndstops == false) {
        // Moses - Hotfix for failing probe endstop
        // For some reason sometimes the motion3buffer is empty/null when endstopTriggered is called,
        // leading to a failed z probe and then a bed crash.
        if (Motion1::endstopMode == EndstopMode::PROBING) {
            Motion3::activateNext();
            act = Motion3::act;
            //endstop failed hit
            DEBUG_MSG_FAST("ESTF");
            if (act == nullptr) {
                //skipped buffer failed
                DEBUG_MSG_FAST("SKBF");
                HAL::delayMilliseconds(2000);
                Printer::kill(0);
                return;
            } else {
                DEBUG_MSG_FAST("ATT");
            } 
        } else {
            //failed act
            DEBUG_MSG_FAST("FNACT");
            return;
        }
    }
...

Regarding interrupt - you are right as I understand the lowest vector is the tie braker here. So adding... I've just tried adding this bit and also updating motion 3's priority in HAL, but unfortunately the problem still persists. I wish I would have dug into this a little further when I first caught it.

However:

There is an error in MotionLevel3.cpp line 166

Thanks! er, this fixes another issue I had assumed was just my configuration/HAL modifcations... software Z endstop works correctly now.

And now: using software endstops seems to eliminate the problem, possibly confirming it's interrupt related.

--

A side note: Once the bed probing problem was fixed, another issue I had was that distortion was being applied in "reverse". My printer's bed physically moves downwards whenever I "raise" the Z axis. I fixed it by changing instances of addDistortion to subDistortion (MotionLevel1.cpp:832, MotionLevel2.cpp:221). I'm assuming this'll be configurable in the future though?

AbsoluteCatalyst avatar Mar 04 '20 05:03 AbsoluteCatalyst

I've just found it odd how we map from probe level to bed with probe. Makes the maps created by Repetier Host look odd and inverted. I would prefer a positive number to mean ved rises abowe z0 and a negative to be below. Thus probing would not change result based on height it was started from.

kyrreaa avatar Mar 04 '20 14:03 kyrreaa

endstopTriggered

eat post. Regarding buf.length you are 100% right. Not only occasionally it is not zero but it just has the length 16 moves ago and then accumulates to that, so setting to 0 is a must.

Regarding act = 0 is also a good hint. That in deed would lead to the hardware interrupt not triggering end stop. And at the end of each motion3 block you have: // Test if we are finished if (--act->stepsRemaining == 0) { if (act->last) { Motion2::pop(); } InterruptProtectedBlock noInts; --length; act = nullptr; } it would only get set back at the next start of motion3 and as @kyrreaa pointed out the interrupt priority would allow it already to interrupt there, but more likely it happens in the time between the 2 motion3 steps so even improved priority handling (which is still good) would not solve it fully. So in total I'm quite sure we now have the reason it happens. Correct solution is an adoption of your idea - not kill printer but run activateNext() if length is remaining. Plus we need to ensure interrupt priority is ok.

Just testing it and fixing delta problems as well to make G32 complete, so I can test the distrotion problem. No need to choose - it just should do it correct:-)

repetier avatar Mar 04 '20 14:03 repetier