Failed to claim receiving frame 13: expected state `Sent`, but got `None` (or `RxProcessing`, or `Sendable`)
Hi, when I run the ek1100 example, the tx_rx_taks fails after a while. Sometimes after a couple seconds, sometime after some minutes.
This is a reproducible example, but I got this and similar errors in my own codebase.
Therese errors were not introduced in 0.5.4 or 0.5.5, because I have seen them before.
Environment
- Version of ethercrab in use or git hash: 7fbd959c62c31b179d2c54259aee3071d761f162
- Operating system kind and version: OSX 15.3.1
- EtherCAT devices in use: EK1100, EL2002, EL3001, EL2521, EL2522
Uploads
Log using ek1100 example:
RUST_LOG=debug cargo run --example ek1100 en6
got None
[2025-03-21T14:18:17Z INFO ek1100] -> SubDevice 0x1000 EK1100 inputs: 0 bytes, outputs: 0 bytes
[2025-03-21T14:18:17Z INFO ek1100] -> SubDevice 0x1001 EL2002 inputs: 0 bytes, outputs: 1 bytes
[2025-03-21T14:18:17Z INFO ek1100] -> SubDevice 0x1002 EL3001 inputs: 2 bytes, outputs: 0 bytes
[2025-03-21T14:18:17Z INFO ek1100] -> SubDevice 0x1003 EL2521-0024 inputs: 8 bytes, outputs: 14 bytes
[2025-03-21T14:18:17Z INFO ek1100] -> SubDevice 0x1004 EL2522 inputs: 16 bytes, outputs: 28 bytes
[2025-03-21T14:18:17Z INFO ek1100] -> SubDevice 0x1005 EL2522 inputs: 16 bytes, outputs: 28 bytes
thread 'main' panicked at examples/ek1100.rs:122:40:
TX/RX: Timeout
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2025-03-21T14:19:56Z ERROR ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 4: expected state Sent, but got None
[2025-03-21T14:19:56Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: invalid PDU index 4
Similar errors I encountered with my codebase:
gotRxProcessing
[2025-03-21T12:09:45Z ERROR ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 13: expected state Sent, but got RxProcessing
[2025-03-21T12:09:45Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: invalid PDU index 13
gotSendable
[2025-03-21T12:16:18Z ERROR ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 10: expected state Sent, but got Sendable
[2025-03-21T12:16:18Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: invalid PDU index 10
And this one
[2025-03-21T12:11:43Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: failed to decode raw PDU data into type
Hey, thanks for the report. For spurious issues like this, please always upload a Wireshark trace if you can capture the failing event. Having the packet logs makes debugging a bit easier on my side :).
Are you able to test the same example on a Linux machine, and if so do you see the same result? I haven't properly tested the macOS network driver in quite some time, so there could be a bug buried in it somewhere.
I ran it on Linux, waited 20min and everything seems fine. I'm glad it's not that severe.
I reproduced it on Mac with a Wireshark capture. I can't see anything off with the process data just by looking at the frames of the last second.
[2025-03-21T17:13:35Z INFO ek1100] -> SubDevice 0x1000 EK1100 inputs: 0 bytes, outputs: 0 bytes
[2025-03-21T17:13:35Z INFO ek1100] -> SubDevice 0x1001 EL2002 inputs: 0 bytes, outputs: 1 bytes
[2025-03-21T17:13:35Z INFO ek1100] -> SubDevice 0x1002 EL3001 inputs: 4 bytes, outputs: 0 bytes
[2025-03-21T17:13:35Z INFO ek1100] -> SubDevice 0x1003 EL2521-0024 inputs: 4 bytes, outputs: 4 bytes
[2025-03-21T17:13:35Z INFO ek1100] -> SubDevice 0x1004 EL2522 inputs: 16 bytes, outputs: 28 bytes
[2025-03-21T17:13:35Z INFO ek1100] -> SubDevice 0x1005 EL2522 inputs: 16 bytes, outputs: 28 bytes
thread 'main' panicked at examples/ek1100.rs:122:40:
TX/RX: Timeout
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2025-03-21T17:21:50Z ERROR ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 4: expected state Sent, but got None
[2025-03-21T17:21:50Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: invalid PDU index 4
Just had a quick look and I see a 57ms hang between the penultimate received and the final sent frame. Not sure if the "failed to claim receiving frame" is a cause or a symptom of that but I'll look a bit deeper tomorrow maybe.
In the meantime, you could try bumping this config value and see where you get although it kinda smells like there's some weirdness going on in macOS.
By the way, you can add a column called "delta time" in Wireshark that will show network round trip times and hangs between loop iterations. I can't remember if I mentioned this before, but if you're using tokio, I'd recommend switching to the smol executor (and timers, etc) instead as tokio can have issues with latency spikes.
Had a similar issue. I ended up just blindly increasing the timeouts in maindeviceconfig. Even then it still takes a couple tries to run the script for it to start detecting the subdevices and running. Sometimes it will work the first run, sometimes it will take 5 retries. Will try getting a wireshark trace later.
Another trace would be useful, thanks!
You could also play around with the retry_behaviour config option. Setting it to e.g. Count(5) could mitigate spurious errors. This is still a bandaid, but it's at least another one to have in the medkit :)
I can also reproduce this occasionally on an M1 MacBook Air using a USB-C ethernet adapter, running with cargo run --example ek1100 en11 (NOT with --release). I'm checked out to tag ethercrab-v0.5.6.
The root cause will be a race condition in the unsafe (🤫) internals of EtherCrab which will be fun to track down...
I noticed that when my Mac (plugged in) went to screensaver, the failed to claim error occurred and the example program stopped. This led me to a method to somewhat reliably reproduce the failure:
- Change tick rate in
ek1100example to1ms cargo run --example ek1100 en11(optionally with--release)- Type in
screensaverin Spotlight and open the settings window for it - The EK1100 example will sometimes crash with
Failed to claim receiving frame ..., sometimes with a non-logging timeout panic, or sometimes keep working fine
All this said, the error itself is a bit of a red herring. When a timeout occurs, some internal state is cleaned up by custom Drop impls, so when a response frame is received and the TX/RX task is still running it will try to fetch the now-invalid slot that frame belongs to, causing the error.
@1cedsoda For the other errors you posted under Similar errors I encountered with my codebase:, do they all occur after the timeout panic, or before it?
I have a few things to try:
- Always run with
--release- this mitigates the crashyness a bit - Consider increasing cycle times and when running on macOS with
#[cfg()]attributes - Do the same with
Timeouts::pdu - Turn on TX retries with
MainDeviceConfig::retry_behaviour - Use the
smolexecutor instead oftokio. This is general advice for anyone seeking better EtherCrab performance really.tokioisn't as consistent assmoland seems to suffer when system load is higher.
If the errors are always after a timeout panic, I'm afraid there's no definitive bug to fix or action to take, so I'll probably close this issue. Let me know how you get on.
Can't see any timeout panic here
[2025-04-03T13:48:00Z INFO server::ethercat::r#loop] Starting control loop
[2025-04-03T13:50:13Z ERROR ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 10: expected state Sent, but got Sendable
[2025-04-03T13:50:13Z ERROR ethercrab::std::unix] Failed to receive frame: pdu: invalid PDU index 10
I will try the things you proposed.
I will transition to smol but have to find out how to mix it with tokio properly which I need for HTTP and SocketIO .
Hm. Strange. Are you on v0.6? That might have something magical in it that fixes everything.
As usual, please attach Wireshark captures, full logs and source code if it's easy enough to capture these failures. I can't do much without the full picture. At minimum, the Wireshark capture will tell me if everything is ok on the network side.
Are you sending a lot of traffic with a fast cycle time? If you are, consider increasing the MAX_FRAMES const generic, i.e. this one.
Quick update: Problem seems to have disappeared with 0.6.0 (at least for me). Although this time I ran the ek1100 example on its own (not a clone repo).
@1cedsoda I'll close this as resolved unless you're still experiencing the issue. If so, please reopen.
Hi, @1cedsoda, me and a few others from our group are still struggling with disconnects on our machines. One machine contains roughly 10 sub devices all connected with two EK1100 connected in serial.
Ethercat connections are managed by ethercrab now running on a real time thread, as suggested. This helped for a few weeks, however, disconnects keep reappearing. We assume that having more devices makes disconnects more likely. You can see how we initialize ethercrab here: https://github.com/qitechgmbh/control/blob/57f6e50e0e7eadd43c8cf39e855e30fa72a98108/server/src/ethercat/setup.rs#L43
We caught two disconnects with wireshark: https://mega.nz/file/DR0gXCLD#fnOkjBQ1BEN6_8Zq5hflbtXEsaYu-vvQ3rzxVIJerp4 In this share, you can also see our log files. We assume that the most important lines are
Aug 05 14:11:24 nixos server[995]: Failed to receive frame: pdu: failed to decode raw PDU data into type
Aug 05 14:11:24 nixos server[995]: Failed to receive frame: pdu: failed to decode raw PDU data into type
Aug 05 14:11:25 nixos server[995]: Memory: 2_986 KB (-9 KB/s)
Aug 05 14:11:25 nixos server[995]: Memory: 2_986 KB (-9 KB/s)
Aug 05 14:11:29 nixos server[995]: Loop failed
timeout
Stack backtrace:
0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
1: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
2: server::loop::init_loop::{{closure}}::{{closure}}
3: <futures_lite::future::Or<F1,F2> as core::future::future::Future>::poll
4: std::thread::local::LocalKey<T>::with
5: async_io::driver::block_on
6: server::loop::init_loop::{{closure}}
7: std::sys::backtrace::__rust_begin_short_backtrace
8: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
9: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
10: core::ops::function::FnOnce::call_once{{vtable.shim}}
11: std::sys::pal::unix::thread::Thread::new::thread_start
12: start_thread
13: clone
Aug 05 14:11:29 nixos systemd[1]: qitech-control-server.service: Main process exited, code=exited, status=1/FAILURE
Aug 05 14:11:29 nixos systemd[1]: qitech-control-server.service: Failed with result 'exit-code'.
and
Aug 05 15:02:06 nixos server[999]: Failed to claim receiving frame 13: expected state Sent, but got RxDone
Aug 05 15:02:06 nixos server[999]: Failed to receive frame: pdu: invalid PDU index 13
Aug 05 15:02:06 nixos server[999]: Failed to claim receiving frame 13: expected state Sent, but got RxDone
Aug 05 15:02:06 nixos server[999]: Failed to receive frame: pdu: invalid PDU index 13
Aug 05 15:02:11 nixos server[999]: Loop failed
timeout
Stack backtrace:
0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
1: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
2: server::loop::init_loop::{{closure}}::{{closure}}
3: <futures_lite::future::Or<F1,F2> as core::future::future::Future>::poll
4: std::thread::local::LocalKey<T>::with
5: async_io::driver::block_on
6: server::loop::init_loop::{{closure}}
7: std::sys::backtrace::__rust_begin_short_backtrace
8: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
9: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
10: core::ops::function::FnOnce::call_once{{vtable.shim}}
11: std::sys::pal::unix::thread::Thread::new::thread_start
12: start_thread
13: clone
Aug 05 15:02:11 nixos systemd[1]: qitech-control-server.service: Main process exited, code=exited, status=1/FAILURE
Aug 05 15:02:11 nixos systemd[1]: qitech-control-server.service: Failed with result 'exit-code'.
Environment
Rust Version: 1.86 ethercrab version: 0.6
We still have this problem unfortunately. Increasing PDU timeouts were sadly not a realiable. fix We are running ethercrab with realtime threads on Linux also and still get this problem every couple hours.
As a minimal fix it would be great if the error fails the tx_rx function not the tx_rx_task so the frame can be skipped/retried and the general network setup is not disrupted.
Here is what GPT-5 thinks about this issue so take it with a grain of salt 🧂.
Problem Statement
The EtherCrab library experiences intermittent failures with the error:
Failed to claim receiving frame X: expected state `Sent`, but got `None`
or RxProcessing, or Sendable
This error occurs unpredictably during normal operation, sometimes after seconds, sometimes after minutes, causing the TX/RX task to fail and disrupting EtherCAT communication.
Issue Summary
Affected Component: PDU (Protocol Data Unit) frame state management system
Symptom: Race condition in frame state transitions
Impact: Intermittent communication failures, system timeouts, unreliable operation
Frequency: Unpredictable - occurs under network stress or timing pressure
Environment: Affects both Windows and Linux, more pronounced on Windows due to timer resolution issues
Error Manifestations
Failed to claim receiving frame X: expected state Sent, but got NoneFailed to claim receiving frame X: expected state Sent, but got RxProcessingFailed to claim receiving frame X: expected state Sent, but got Sendable- Subsequent
TX/RX: Timeouterrors - Complete communication breakdown requiring restart
Technical Analysis
Root Cause Analysis
The issue stems from concurrent access to frame state without proper synchronization. Here's the detailed breakdown:
1. Frame State Machine Vulnerability
// Current problematic pattern (simplified)
enum FrameState {
None,
Sendable,
Sent,
RxProcessing,
}
// Non-atomic state transitions create race conditions
if frame.state == FrameState::Sent {
frame.state = FrameState::RxProcessing; // ⚠️ Race condition here
// Process frame...
}
2. Timing-Related Race Conditions
- Network delays: Frames arrive out of order or after timeouts
- Thread scheduling: State checks and updates are not atomic
- Timer resolution: Windows ~15ms timer granularity exacerbates timing issues
- Concurrent access: Multiple code paths accessing frame state simultaneously
3. Specific Failure Scenarios
Scenario A: Timeout Race
Thread 1: Checks frame.state == Sent ✓
Thread 2: Timeout handler sets frame.state = None
Thread 1: Attempts to set frame.state = RxProcessing (but state is now None)
Result: "expected state Sent, but got None"
Scenario B: Double Processing
Network: Frame arrives
Thread 1: Claims frame (Sent → RxProcessing)
Network: Duplicate/delayed frame arrives
Thread 2: Attempts to claim same frame
Result: "expected state Sent, but got RxProcessing"
Scenario C: Premature State Transition
Thread 1: Marks frame as Sendable
Thread 2: Network interface tries to claim before marking as Sent
Result: "expected state Sent, but got Sendable"
Proposed Fixing Approaches
Approach 1: Atomic State Management (Recommended)
Complexity: Medium | Reliability: High | Performance Impact: Minimal
Replace current state management with atomic operations:
use std::sync::atomic::{AtomicU8, Ordering};
pub struct FrameElement {
state: AtomicU8,
}
impl FrameElement {
fn try_claim_for_receive(&self) -> Result<(), Error> {
match self.state.compare_exchange_weak(
FrameState::Sent as u8,
FrameState::RxProcessing as u8,
Ordering::AcqRel,
Ordering::Acquire,
) {
Ok(_) => Ok(()),
Err(actual) => Err(Error::InvalidFrameState {
expected: FrameState::Sent,
actual: FrameState::from(actual),
}),
}
}
}
Pros:
- Thread-safe state transitions
- Eliminates race conditions
- Minimal performance overhead
- Maintains existing API structure
Cons:
- Requires changes to core frame management
- Need to audit all state access points
Approach 2: Mutex-Based Synchronization
Complexity: High | Reliability: High | Performance Impact: Moderate
Protect frame state with mutexes for complete synchronization.
Pros: Complete synchronization, easy to implement correctly
Cons: Higher performance overhead, potential for deadlocks
Approach 3: State Machine Redesign
Complexity: High | Reliability: High | Performance Impact: Minimal
Redesign the state machine with explicit transitions and better error handling.
Pros: More robust, easier to extend, built-in transition tracking
Cons: Significant refactoring required, API changes needed
Approach 4: Retry-Based Recovery
Complexity: Low | Reliability: Medium | Performance Impact: Low
Add retry logic when state mismatches occur.
Pros: Minimal code changes, handles transient issues
Cons: Doesn't fix root cause, may mask underlying problems
Approach 5: Event-Driven State Management
Complexity: Very High | Reliability: High | Performance Impact: Low
Implement event-driven state management with serialized state changes.
Pros: Complete serialization, easy debugging, no race conditions
Cons: Major architectural change, potential bottleneck
Recommendation
Primary Recommendation: Approach 1 (Atomic State Management)
This approach provides the best balance of reliability, performance, and maintainability while preserving the existing architecture.
Implementation Priority:
- Implement atomic state transitions for frame claiming
- Add comprehensive error handling and logging
- Improve timeout handling and recovery
- Add configuration options for timeout values
- Enhance testing with race condition scenarios
Secondary Recommendation: Approach 3 (State Machine Redesign)
For long-term reliability, consider the state machine redesign as it provides more robust error handling and better debugging capabilities.
Implementation Considerations
Testing Strategy
- Unit tests: Test all state transitions under concurrent conditions
- Integration tests: Test with network delays and packet loss
- Stress tests: High-frequency frame operations
- Race condition tests: Use tools like
loomfor Rust concurrency testing
Monitoring and Debugging
- Add detailed logging for all state transitions
- Include frame timing information
- Implement metrics collection for failure rates
- Add debug modes for detailed tracing
Performance Considerations
- Benchmark atomic operations vs current implementation
- Monitor for any latency increases
- Test under high-load scenarios
- Validate real-time performance requirements
Conclusion
The frame state management race condition is a critical reliability issue that requires immediate attention. The atomic state management approach provides the most practical solution with minimal risk and maximum benefit. Implementation should be prioritized to ensure EtherCAT communication reliability in production environments.
This document provides a complete analysis that another developer can use to understand the issue thoroughly and implement an appropriate fix based on their specific requirements and constraints.
@Oshgnacknak
Both crash1/err.pcapng and crash2/crash.pcapng show semi regular RTT spikes of up to 1ms. Your cycle time is also 1ms, which leaves no time left for any processing.
Have you configured your NIC with ethtool? I can't remember how but you can definitely do it with Nix. tx-usecs and rx-usecs should be 0. You can check the current value with ethtool -c <interface>.
The cycle-cycle delta is also very odd. It flips between ~800us and ~100us:
This value should be consistent and very close to your configured cycle time. I think LoopThrottle might not be doing what you want it to do. Try increasing the cycle time to 2ms and see if anything gets better. I don't think you need 1ms but I'll leave that up to you.
This could possibly be the reason that packet timeouts occur - you might be sending stuff too fast for the SubDevices to respond. Beckhoff specifies 125us minimum cycle time for some/all their devices and a 100us cycle-cycle delta is under that.
You might want to switch to the blocking tx_rx_io_uring for a bit more performance.
I'd also recommend enabling Distributed Clocks and using the EtherCAT system time on the EK1100 as the cycle reference. You're still somewhat at the mercy of Linux timers but using something like timerfd can give good results. DC is a lot more involved to set up though.
After 36h of no further crashes, we assume that both ethtool and io-uring together solved the issue. Thanks again for the time and effort.
Great! Did you get to the bottom of the strange cycle-cycle delta timings too? I can see that causing you issues in the future.
Hello,
I'm encountering a similar issue.
ethercrab::pdu_loop::frame_element] Failed to claim receiving frame 14: expected state Sent, but got Sendable.
On Windows, when I paste text (multiline text) into the VS Code integrated terminal, it somehow disrupts EtherCrab’s operation and the EtherCAT connection is lost. Once this happens, the connection cannot be recovered.
I tried setting retry_behaviour: RetryBehaviour::Forever, but it didn’t help restore the connection.
Any ideas on what might be causing this or how to make the client more robust against terminal paste events?
Thank you very much!
@sclee-aether Please open a new issue with a Wireshark capture of when you encounter the error. I'd recommend against using Windows if that's possible for your use case.