ethercrab icon indicating copy to clipboard operation
ethercrab copied to clipboard

Failed to claim receiving frame 13: expected state `Sent`, but got `None` (or `RxProcessing`, or `Sendable`)

Open 1cedsoda opened this issue 9 months ago • 10 comments

Hi, when I run the ek1100 example, the tx_rx_taks fails after a while. Sometimes after a couple seconds, sometime after some minutes. This is a reproducible example, but I got this and similar errors in my own codebase. Therese errors were not introduced in 0.5.4 or 0.5.5, because I have seen them before.

Environment

  • Version of ethercrab in use or git hash: 7fbd959c62c31b179d2c54259aee3071d761f162
  • Operating system kind and version: OSX 15.3.1
  • EtherCAT devices in use: EK1100, EL2002, EL3001, EL2521, EL2522

Uploads

Log using ek1100 example:

RUST_LOG=debug cargo run --example ek1100 en6

got None

[2025-03-21T14:18:17Z INFO  ek1100] -> SubDevice 0x1000 EK1100 inputs: 0 bytes, outputs: 0 bytes
[2025-03-21T14:18:17Z INFO  ek1100] -> SubDevice 0x1001 EL2002 inputs: 0 bytes, outputs: 1 bytes
[2025-03-21T14:18:17Z INFO  ek1100] -> SubDevice 0x1002 EL3001 inputs: 2 bytes, outputs: 0 bytes
[2025-03-21T14:18:17Z INFO  ek1100] -> SubDevice 0x1003 EL2521-0024 inputs: 8 bytes, outputs: 14 bytes
[2025-03-21T14:18:17Z INFO  ek1100] -> SubDevice 0x1004 EL2522 inputs: 16 bytes, outputs: 28 bytes
[2025-03-21T14:18:17Z INFO  ek1100] -> SubDevice 0x1005 EL2522 inputs: 16 bytes, outputs: 28 bytes
thread 'main' panicked at examples/ek1100.rs:122:40:
TX/RX: Timeout
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2025-03-21T14:19:56Z ERROR ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 4: expected state Sent, but got None
[2025-03-21T14:19:56Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: invalid PDU index 4

Similar errors I encountered with my codebase:

gotRxProcessing

[2025-03-21T12:09:45Z ERROR ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 13: expected state Sent, but got RxProcessing
[2025-03-21T12:09:45Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: invalid PDU index 13

gotSendable

[2025-03-21T12:16:18Z ERROR ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 10: expected state Sent, but got Sendable
[2025-03-21T12:16:18Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: invalid PDU index 10

And this one

[2025-03-21T12:11:43Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: failed to decode raw PDU data into type

1cedsoda avatar Mar 21 '25 14:03 1cedsoda

Hey, thanks for the report. For spurious issues like this, please always upload a Wireshark trace if you can capture the failing event. Having the packet logs makes debugging a bit easier on my side :).

Are you able to test the same example on a Linux machine, and if so do you see the same result? I haven't properly tested the macOS network driver in quite some time, so there could be a bug buried in it somewhere.

jamwaffles avatar Mar 21 '25 14:03 jamwaffles

I ran it on Linux, waited 20min and everything seems fine. I'm glad it's not that severe.

I reproduced it on Mac with a Wireshark capture. I can't see anything off with the process data just by looking at the frames of the last second.

[2025-03-21T17:13:35Z INFO  ek1100] -> SubDevice 0x1000 EK1100 inputs: 0 bytes, outputs: 0 bytes
[2025-03-21T17:13:35Z INFO  ek1100] -> SubDevice 0x1001 EL2002 inputs: 0 bytes, outputs: 1 bytes
[2025-03-21T17:13:35Z INFO  ek1100] -> SubDevice 0x1002 EL3001 inputs: 4 bytes, outputs: 0 bytes
[2025-03-21T17:13:35Z INFO  ek1100] -> SubDevice 0x1003 EL2521-0024 inputs: 4 bytes, outputs: 4 bytes
[2025-03-21T17:13:35Z INFO  ek1100] -> SubDevice 0x1004 EL2522 inputs: 16 bytes, outputs: 28 bytes
[2025-03-21T17:13:35Z INFO  ek1100] -> SubDevice 0x1005 EL2522 inputs: 16 bytes, outputs: 28 bytes
thread 'main' panicked at examples/ek1100.rs:122:40:
TX/RX: Timeout
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2025-03-21T17:21:50Z ERROR ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 4: expected state Sent, but got None
[2025-03-21T17:21:50Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: invalid PDU index 4

capture3.pcapng.gz

1cedsoda avatar Mar 21 '25 17:03 1cedsoda

Just had a quick look and I see a 57ms hang between the penultimate received and the final sent frame. Not sure if the "failed to claim receiving frame" is a cause or a symptom of that but I'll look a bit deeper tomorrow maybe.

In the meantime, you could try bumping this config value and see where you get although it kinda smells like there's some weirdness going on in macOS.

By the way, you can add a column called "delta time" in Wireshark that will show network round trip times and hangs between loop iterations. I can't remember if I mentioned this before, but if you're using tokio, I'd recommend switching to the smol executor (and timers, etc) instead as tokio can have issues with latency spikes.

jamwaffles avatar Mar 21 '25 21:03 jamwaffles

Had a similar issue. I ended up just blindly increasing the timeouts in maindeviceconfig. Even then it still takes a couple tries to run the script for it to start detecting the subdevices and running. Sometimes it will work the first run, sometimes it will take 5 retries. Will try getting a wireshark trace later.

andergisomon avatar Mar 28 '25 06:03 andergisomon

Another trace would be useful, thanks!

You could also play around with the retry_behaviour config option. Setting it to e.g. Count(5) could mitigate spurious errors. This is still a bandaid, but it's at least another one to have in the medkit :)

jamwaffles avatar Mar 28 '25 07:03 jamwaffles

I can also reproduce this occasionally on an M1 MacBook Air using a USB-C ethernet adapter, running with cargo run --example ek1100 en11 (NOT with --release). I'm checked out to tag ethercrab-v0.5.6.

The root cause will be a race condition in the unsafe (🤫) internals of EtherCrab which will be fun to track down...

jamwaffles avatar Mar 28 '25 20:03 jamwaffles

I noticed that when my Mac (plugged in) went to screensaver, the failed to claim error occurred and the example program stopped. This led me to a method to somewhat reliably reproduce the failure:

  • Change tick rate in ek1100 example to 1ms
  • cargo run --example ek1100 en11 (optionally with --release)
  • Type in screensaver in Spotlight and open the settings window for it
  • The EK1100 example will sometimes crash with Failed to claim receiving frame ..., sometimes with a non-logging timeout panic, or sometimes keep working fine

All this said, the error itself is a bit of a red herring. When a timeout occurs, some internal state is cleaned up by custom Drop impls, so when a response frame is received and the TX/RX task is still running it will try to fetch the now-invalid slot that frame belongs to, causing the error.

@1cedsoda For the other errors you posted under Similar errors I encountered with my codebase:, do they all occur after the timeout panic, or before it?

I have a few things to try:

  1. Always run with --release - this mitigates the crashyness a bit
  2. Consider increasing cycle times and when running on macOS with #[cfg()] attributes
  3. Do the same with Timeouts::pdu
  4. Turn on TX retries with MainDeviceConfig::retry_behaviour
  5. Use the smol executor instead of tokio. This is general advice for anyone seeking better EtherCrab performance really. tokio isn't as consistent as smol and seems to suffer when system load is higher.

If the errors are always after a timeout panic, I'm afraid there's no definitive bug to fix or action to take, so I'll probably close this issue. Let me know how you get on.

jamwaffles avatar Mar 28 '25 21:03 jamwaffles

Can't see any timeout panic here

[2025-04-03T13:48:00Z INFO  server::ethercat::r#loop] Starting control loop
[2025-04-03T13:50:13Z ERROR ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 10: expected state Sent, but got Sendable
[2025-04-03T13:50:13Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: invalid PDU index 10

I will try the things you proposed.

I will transition to smol but have to find out how to mix it with tokio properly which I need for HTTP and SocketIO .

1cedsoda avatar Apr 03 '25 13:04 1cedsoda

Hm. Strange. Are you on v0.6? That might have something magical in it that fixes everything.

As usual, please attach Wireshark captures, full logs and source code if it's easy enough to capture these failures. I can't do much without the full picture. At minimum, the Wireshark capture will tell me if everything is ok on the network side.

Are you sending a lot of traffic with a fast cycle time? If you are, consider increasing the MAX_FRAMES const generic, i.e. this one.

jamwaffles avatar Apr 03 '25 14:04 jamwaffles

Quick update: Problem seems to have disappeared with 0.6.0 (at least for me). Although this time I ran the ek1100 example on its own (not a clone repo).

ek1100_capture_0.6.0_2025-04-07T09:13+08:00.zip

andergisomon avatar Apr 07 '25 01:04 andergisomon

@1cedsoda I'll close this as resolved unless you're still experiencing the issue. If so, please reopen.

jamwaffles avatar Jul 05 '25 11:07 jamwaffles

Hi, @1cedsoda, me and a few others from our group are still struggling with disconnects on our machines. One machine contains roughly 10 sub devices all connected with two EK1100 connected in serial.

Ethercat connections are managed by ethercrab now running on a real time thread, as suggested. This helped for a few weeks, however, disconnects keep reappearing. We assume that having more devices makes disconnects more likely. You can see how we initialize ethercrab here: https://github.com/qitechgmbh/control/blob/57f6e50e0e7eadd43c8cf39e855e30fa72a98108/server/src/ethercat/setup.rs#L43

We caught two disconnects with wireshark: https://mega.nz/file/DR0gXCLD#fnOkjBQ1BEN6_8Zq5hflbtXEsaYu-vvQ3rzxVIJerp4 In this share, you can also see our log files. We assume that the most important lines are

Aug 05 14:11:24 nixos server[995]: Failed to receive frame: pdu: failed to decode raw PDU data into type
Aug 05 14:11:24 nixos server[995]: Failed to receive frame: pdu: failed to decode raw PDU data into type
Aug 05 14:11:25 nixos server[995]: Memory: 2_986 KB (-9 KB/s)
Aug 05 14:11:25 nixos server[995]: Memory: 2_986 KB (-9 KB/s)
Aug 05 14:11:29 nixos server[995]: Loop failed
                                   timeout
                                   Stack backtrace:
                                      0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
                                      1: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
                                      2: server::loop::init_loop::{{closure}}::{{closure}}
                                      3: <futures_lite::future::Or<F1,F2> as core::future::future::Future>::poll
                                      4: std::thread::local::LocalKey<T>::with
                                      5: async_io::driver::block_on
                                      6: server::loop::init_loop::{{closure}}
                                      7: std::sys::backtrace::__rust_begin_short_backtrace
                                      8: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
                                      9: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
                                     10: core::ops::function::FnOnce::call_once{{vtable.shim}}
                                     11: std::sys::pal::unix::thread::Thread::new::thread_start
                                     12: start_thread
                                     13: clone
Aug 05 14:11:29 nixos systemd[1]: qitech-control-server.service: Main process exited, code=exited, status=1/FAILURE
Aug 05 14:11:29 nixos systemd[1]: qitech-control-server.service: Failed with result 'exit-code'.

and

Aug 05 15:02:06 nixos server[999]: Failed to claim receiving frame 13: expected state Sent, but got RxDone
Aug 05 15:02:06 nixos server[999]: Failed to receive frame: pdu: invalid PDU index 13
Aug 05 15:02:06 nixos server[999]: Failed to claim receiving frame 13: expected state Sent, but got RxDone
Aug 05 15:02:06 nixos server[999]: Failed to receive frame: pdu: invalid PDU index 13
Aug 05 15:02:11 nixos server[999]: Loop failed
                                   timeout
                                   Stack backtrace:
                                      0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
                                      1: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
                                      2: server::loop::init_loop::{{closure}}::{{closure}}
                                      3: <futures_lite::future::Or<F1,F2> as core::future::future::Future>::poll
                                      4: std::thread::local::LocalKey<T>::with
                                      5: async_io::driver::block_on
                                      6: server::loop::init_loop::{{closure}}
                                      7: std::sys::backtrace::__rust_begin_short_backtrace
                                      8: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
                                      9: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
                                     10: core::ops::function::FnOnce::call_once{{vtable.shim}}
                                     11: std::sys::pal::unix::thread::Thread::new::thread_start
                                     12: start_thread
                                     13: clone
Aug 05 15:02:11 nixos systemd[1]: qitech-control-server.service: Main process exited, code=exited, status=1/FAILURE
Aug 05 15:02:11 nixos systemd[1]: qitech-control-server.service: Failed with result 'exit-code'.

Environment

Rust Version: 1.86 ethercrab version: 0.6

Oshgnacknak avatar Aug 07 '25 14:08 Oshgnacknak

We still have this problem unfortunately. Increasing PDU timeouts were sadly not a realiable. fix We are running ethercrab with realtime threads on Linux also and still get this problem every couple hours.

As a minimal fix it would be great if the error fails the tx_rx function not the tx_rx_task so the frame can be skipped/retried and the general network setup is not disrupted.

Here is what GPT-5 thinks about this issue so take it with a grain of salt 🧂.

Problem Statement

The EtherCrab library experiences intermittent failures with the error:

Failed to claim receiving frame X: expected state `Sent`, but got `None`

or RxProcessing, or Sendable

This error occurs unpredictably during normal operation, sometimes after seconds, sometimes after minutes, causing the TX/RX task to fail and disrupting EtherCAT communication.

Issue Summary

Affected Component: PDU (Protocol Data Unit) frame state management system
Symptom: Race condition in frame state transitions
Impact: Intermittent communication failures, system timeouts, unreliable operation
Frequency: Unpredictable - occurs under network stress or timing pressure
Environment: Affects both Windows and Linux, more pronounced on Windows due to timer resolution issues

Error Manifestations

  1. Failed to claim receiving frame X: expected state Sent, but got None
  2. Failed to claim receiving frame X: expected state Sent, but got RxProcessing
  3. Failed to claim receiving frame X: expected state Sent, but got Sendable
  4. Subsequent TX/RX: Timeout errors
  5. Complete communication breakdown requiring restart

Technical Analysis

Root Cause Analysis

The issue stems from concurrent access to frame state without proper synchronization. Here's the detailed breakdown:

1. Frame State Machine Vulnerability

// Current problematic pattern (simplified)
enum FrameState {
    None,
    Sendable, 
    Sent,
    RxProcessing,
}

// Non-atomic state transitions create race conditions
if frame.state == FrameState::Sent {
    frame.state = FrameState::RxProcessing;  // ⚠️ Race condition here
    // Process frame...
}

2. Timing-Related Race Conditions

  • Network delays: Frames arrive out of order or after timeouts
  • Thread scheduling: State checks and updates are not atomic
  • Timer resolution: Windows ~15ms timer granularity exacerbates timing issues
  • Concurrent access: Multiple code paths accessing frame state simultaneously

3. Specific Failure Scenarios

Scenario A: Timeout Race

Thread 1: Checks frame.state == Sent ✓
Thread 2: Timeout handler sets frame.state = None
Thread 1: Attempts to set frame.state = RxProcessing (but state is now None)
Result: "expected state Sent, but got None"

Scenario B: Double Processing

Network: Frame arrives
Thread 1: Claims frame (Sent → RxProcessing)
Network: Duplicate/delayed frame arrives  
Thread 2: Attempts to claim same frame
Result: "expected state Sent, but got RxProcessing"

Scenario C: Premature State Transition

Thread 1: Marks frame as Sendable
Thread 2: Network interface tries to claim before marking as Sent
Result: "expected state Sent, but got Sendable"

Proposed Fixing Approaches

Approach 1: Atomic State Management (Recommended)

Complexity: Medium | Reliability: High | Performance Impact: Minimal

Replace current state management with atomic operations:

use std::sync::atomic::{AtomicU8, Ordering};

pub struct FrameElement {
    state: AtomicU8,
}

impl FrameElement {
    fn try_claim_for_receive(&self) -> Result<(), Error> {
        match self.state.compare_exchange_weak(
            FrameState::Sent as u8,
            FrameState::RxProcessing as u8,
            Ordering::AcqRel,
            Ordering::Acquire,
        ) {
            Ok(_) => Ok(()),
            Err(actual) => Err(Error::InvalidFrameState { 
                expected: FrameState::Sent,
                actual: FrameState::from(actual),
            }),
        }
    }
}

Pros:

  • Thread-safe state transitions
  • Eliminates race conditions
  • Minimal performance overhead
  • Maintains existing API structure

Cons:

  • Requires changes to core frame management
  • Need to audit all state access points

Approach 2: Mutex-Based Synchronization

Complexity: High | Reliability: High | Performance Impact: Moderate

Protect frame state with mutexes for complete synchronization.

Pros: Complete synchronization, easy to implement correctly
Cons: Higher performance overhead, potential for deadlocks

Approach 3: State Machine Redesign

Complexity: High | Reliability: High | Performance Impact: Minimal

Redesign the state machine with explicit transitions and better error handling.

Pros: More robust, easier to extend, built-in transition tracking
Cons: Significant refactoring required, API changes needed

Approach 4: Retry-Based Recovery

Complexity: Low | Reliability: Medium | Performance Impact: Low

Add retry logic when state mismatches occur.

Pros: Minimal code changes, handles transient issues
Cons: Doesn't fix root cause, may mask underlying problems

Approach 5: Event-Driven State Management

Complexity: Very High | Reliability: High | Performance Impact: Low

Implement event-driven state management with serialized state changes.

Pros: Complete serialization, easy debugging, no race conditions
Cons: Major architectural change, potential bottleneck

Recommendation

Primary Recommendation: Approach 1 (Atomic State Management)

This approach provides the best balance of reliability, performance, and maintainability while preserving the existing architecture.

Implementation Priority:

  1. Implement atomic state transitions for frame claiming
  2. Add comprehensive error handling and logging
  3. Improve timeout handling and recovery
  4. Add configuration options for timeout values
  5. Enhance testing with race condition scenarios

Secondary Recommendation: Approach 3 (State Machine Redesign)

For long-term reliability, consider the state machine redesign as it provides more robust error handling and better debugging capabilities.

Implementation Considerations

Testing Strategy

  • Unit tests: Test all state transitions under concurrent conditions
  • Integration tests: Test with network delays and packet loss
  • Stress tests: High-frequency frame operations
  • Race condition tests: Use tools like loom for Rust concurrency testing

Monitoring and Debugging

  • Add detailed logging for all state transitions
  • Include frame timing information
  • Implement metrics collection for failure rates
  • Add debug modes for detailed tracing

Performance Considerations

  • Benchmark atomic operations vs current implementation
  • Monitor for any latency increases
  • Test under high-load scenarios
  • Validate real-time performance requirements

Conclusion

The frame state management race condition is a critical reliability issue that requires immediate attention. The atomic state management approach provides the most practical solution with minimal risk and maximum benefit. Implementation should be prioritized to ensure EtherCAT communication reliability in production environments.


This document provides a complete analysis that another developer can use to understand the issue thoroughly and implement an appropriate fix based on their specific requirements and constraints.

1cedsoda avatar Aug 08 '25 12:08 1cedsoda

@Oshgnacknak

Both crash1/err.pcapng and crash2/crash.pcapng show semi regular RTT spikes of up to 1ms. Your cycle time is also 1ms, which leaves no time left for any processing.

Image

Have you configured your NIC with ethtool? I can't remember how but you can definitely do it with Nix. tx-usecs and rx-usecs should be 0. You can check the current value with ethtool -c <interface>.

The cycle-cycle delta is also very odd. It flips between ~800us and ~100us:

Image

This value should be consistent and very close to your configured cycle time. I think LoopThrottle might not be doing what you want it to do. Try increasing the cycle time to 2ms and see if anything gets better. I don't think you need 1ms but I'll leave that up to you.

This could possibly be the reason that packet timeouts occur - you might be sending stuff too fast for the SubDevices to respond. Beckhoff specifies 125us minimum cycle time for some/all their devices and a 100us cycle-cycle delta is under that.

You might want to switch to the blocking tx_rx_io_uring for a bit more performance.

I'd also recommend enabling Distributed Clocks and using the EtherCAT system time on the EK1100 as the cycle reference. You're still somewhat at the mercy of Linux timers but using something like timerfd can give good results. DC is a lot more involved to set up though.

jamwaffles avatar Aug 11 '25 00:08 jamwaffles

After 36h of no further crashes, we assume that both ethtool and io-uring together solved the issue. Thanks again for the time and effort.

Oshgnacknak avatar Aug 14 '25 12:08 Oshgnacknak

Great! Did you get to the bottom of the strange cycle-cycle delta timings too? I can see that causing you issues in the future.

jamwaffles avatar Aug 14 '25 17:08 jamwaffles

Hello,

I'm encountering a similar issue.

ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 14: expected state Sent, but got Sendable.

On Windows, when I paste text (multiline text) into the VS Code integrated terminal, it somehow disrupts EtherCrab’s operation and the EtherCAT connection is lost. Once this happens, the connection cannot be recovered.

I tried setting retry_behaviour: RetryBehaviour::Forever, but it didn’t help restore the connection.

Any ideas on what might be causing this or how to make the client more robust against terminal paste events?

Thank you very much!

sclee-aether avatar Dec 01 '25 09:12 sclee-aether

@sclee-aether Please open a new issue with a Wireshark capture of when you encounter the error. I'd recommend against using Windows if that's possible for your use case.

jamwaffles avatar Dec 01 '25 16:12 jamwaffles