machinekit Design error: command serials may collide

trafficstars

Commands passed via NML typically have integer serial number for tracking status (received, being acted upon, completed, errored).

There is a fundamental design error: serial numbers are assigned on the originating side of a command. As soon as more than one command originator is active, this creates the danger of collisions, as NML has no identifier for an originator which could be used to make a submission ID unique (using a tuple (submitter, serial)).

An example for a hack around this issue is here: https://github.com/machinekit/machinekit/blob/master/src/emc/usr_intf/emcrsh.cc#L2842

problem: specify and implement a serial scheme which guarantees uniqueness.

Apr 11 '14 17:04 mhaberler

This is a real issue if you like to use HALUI in together with a normal UI like AXIS (please refer to my original bug report to LinuxCNC: http://sourceforge.net/p/emc/bugs/328/)

The patch I've submitted there is more a intermediate hot fix to ensure atomic submission ID's. A real solution to this issue will require the implementation of a publish/subscribe pattern for the result notification if a asynchronous design is required or blocking calls for command submission in case of a synchronous design. In my opinion the first one is more flexible and fits better into the current design. Michael has already done a lot of work regarding this with his ZMQ/Protobuf migration.

Apr 15 '14 12:04 sittner

I see several ways to deal with the issue, I'm not totally happy with either:

(1) any submitter first aquires a unique ID from an ID-issuing service. Guarantees global uniqueness and total ordering of ID's, no holes between ID's. There's a significant overhead associated with that.

(2) There's a single entity accepting commands and that entity issues ID's in response to the request. That guarantees ordering and no holes between ID's. De facto the global serial is a local variable of that entity. Unfortunately the single command acceptor is a significant restriction. On second thought, I exclude that option.

(3) There's a service which hands out serial number ranges, eg in blocks of say 1000 for the sake of example. Any command submitter aquires a block of ID's first, and whenever that block is depleted. Upside: total ordering, uniqueness, efficient. Downside: ID ranges might have holes (i.e. not strictly sequential). Bonus upside: the block issuing service may identify the source of a given ID in case it keeps a table {socket name, ID range}.

In tendency I'm leaning towards (3); the natural place for ID management is atomic ID allocating code code which already is in rtapi_msgd (for HAL module ID's etc). That range would be unique for an RT instance.

The only questions warranting more consideration are:

is there any downside to having 'holes' between ID's; I dont think there is; the 'equals' and 'greater than' operators are required, but those would not be affected by holes.

Restriction on ID's: they must fit with a HAL data type, i.e. for now 32bit ints. It should be straightforward to do 64bit scalars, but a UUID is out of scope (128bits required).

Question: are 'Lamport timestamps' OK as ID's? http://en.wikipedia.org/wiki/Lamport_timestamps (in other words: must the ID's support determining the order of events per originator, but not globally)

Question: sortability of ID's: certainly must hold for ID's of a single originator, but I do not see q requirement for sortability on the global level

the seminal Lamport paper: http://www.stanford.edu/class/cs240/readings/lamport.pdf

Apr 17 '14 15:04 mhaberler

Here's an interesting presentation on the issue: http://de.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems , covering the theme from various angles

Also http://stackoverflow.com/questions/2671858/distributed-sequence-number-generation discusses some options.

The 'Twitter snowflake' sounds interesting as a basis: Snowflake IDs are 64-bits, half the size of a UUID Can use time as first component and remain sortable Distributed system that can survive nodes dying

The Instagram ID's would probably work, also requires 64bits: http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram IDs consists of:

41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch) 13 bits that represent the logical shard ID 10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond

Another option is simpleflake: http://engineering.custommade.com/simpleflake-distributed-id-generation-for-the-lazy/

A lamport clock, but UUID-size (128bits): https://github.com/jamesgolick/lexical_uuid.erl

Apr 17 '14 15:04 mhaberler

On 4/17/2014 10:59 AM, Michael Haberler wrote:

The 'Twitter snowflake' sounds interesting as a basis; I would like to avoid time-synchronisation as requirement though.

Time synchronization is not required, unless you need to be able to (trivially) sort the ID's by generation time. Even without synchronized clocks, you can align the ID's if debugging by using the generator ID and a time offset for each generator.

The basic idea seems to be that each ID is composed of:

A guaranteed unique generator ID (which could perhaps be assigned as part of the subscribe/publish protocol?)
Enough bits created and assigned locally by the generator to insure no two local IDs are ever the same. Optionally, this value can include a timestamp and be constructed such that it is easy to order the events by creation time.

Charles Steinkuehler [email protected]

Apr 17 '14 17:04 cdsteinkuehler

yes, that's the idea. And time synchronization is a non-requirement, my bad. And global time is not a useful concept for event ordering in such a system anyway.

The generator ID will need to include the HAL module ID's because those are originators/consumers too, not just remote entities like UI's. So something like a remote getNextModuleId() service covers that, and ID's would unambiguously tagged as originating from a certain module (might even help debugging doing that).

As for the type of system (see http://en.wikipedia.org/wiki/Lamport_timestamps#Lamport.27s_logical_clock_in_distributed_systems) I think what we have at hand is a system with partial order between processes, suggesting this concept is the right fit

Actually I see ID's based on Lamport clocks as having upside: right now serials are int's which are happily compared with no safeguards as to origin; but comparing timestamps from different originators makes no sense by definition. But defining happened-before and equality operators on such a datatype would be able to raise an exception if timestamps of different origin are compared, which is a conceptual error. And we get a timestamp for free.

I guess next step is to do some back-of-envelope calculations to derive sensible boundaries for timestamp granularity, serial and module ID which cover the execution model for the forseeable future. I'll make it a macro and #defines just in case.

Apr 17 '14 18:04 mhaberler

let's start with a modified Instagram scheme. Assume 64bits timestamp size.

I'll swap the sequence number and 'shard ID' fields in size since we have different size and accuracy requirements That leaves 10bits or a maximum of 1024 for the moduleID.

Each of our IDs consists of:

41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
13 bits that represent an auto-incrementing sequence, modulus 8192.
10 bits that represent the generator ID, maxing out at 1024 generators.

This means we can generate 8192 IDs, per module, per millisecond, or every 120nS; meaning a rate of 8 million messages/second. This would be good enough for most scenarios except very tight loops (even ringbuffer operations take 100-300nS/message on my Mac), and that scenario is unlikely.

A solution could be: a generator is required to check uniqueness of an ID (this would happen only if called less than 125nS apart, so unlikely, and very low delay; that could even be done as busy-loop in an RT thread).

I think it's safe to assume that an ensemble of more than 1024 components (HAL or otherwise) talking to each other is unlikely to happen.

This timestamp is absolute for the next 41 years. Note the references to Twitter, Instagram etc cover a requirement which we dont have: these ID's are for permanent storage in databases, unambiguously tagging an item with an absolute time reference. We dont have that requirement because ID's need not be persistent across sessions.

There's one variant which might warrant consideration: use timestamps relative to startup. If we assume a maximum instance lifetime of say 1 year, that would gain us 5 bits off the timestamp.

If we put these 5bits to work in the serial (so the cut would be 36bits relative timestamp to startup, 18bits sequence, 10bits generator ID), that would mean we can generate 262 mio ID's per second or roughly a unique ID every 4nS.

Cost/benefit versus the first variant as I see it:

decoding this timestamp into absolute time requires the startup epoch (but since generator ID needs to be managed per-instance there's no extra cost)
uniqueness even in tight loops pretty much guaranteed (but not much of an upside)
the assumed maximum instance lifetime could turn out to be a bummer if this is used for long-running processes like automation.

In fact it might not be necessary to make a hard decision on this anyway. Since we need to centrally manage the generatorID (which translates into a RPC at startup time to retrieve the next unique generator ID) we could just tack on the startup epoch, masks and shift counts as variables to this reply, making the whole decision a configuration time issue.

Apr 18 '14 08:04 mhaberler

On 4/18/2014 3:18 AM, Michael Haberler wrote:

There's one variant which might warrant consideration: use timestamps relative to startup. If we assume a maximum instance lifetime of say 1 year, that would gain us 5 bits off the timestamp.

If we put these 5bits to work in the serial (so the cut would be 36bits relative timestamp to startup, 18bits sequence, 10bits generator ID), that would mean we can generate 262 mio ID's per second or roughly a unique ID every 4nS.

IMHO, all of this is likely massive overkill. The systems I work with get 5-8 bits of unique transaction identifier to use. That is combined with a unique originator ID, but the result is less than 32-bits.

How long is it necessary for a transaction ID to exist? Typically, in high-speed systems like you're talking about (generating IDs on a nanosecond time-scale), the ID only needs to survive for as long as the particular transaction it's identifying.

PCI Express, for example, uses a 16-bit "generator ID" (8-bit bus ID, + 5 bit device number + 3 bit function), combined with up to 8-bits of "tag" that is assigned by the generator. That still leaves 8-bits free in a 32-bit word...exactly how "unique" do these tags need to be?

Charles Steinkuehler [email protected]

Apr 18 '14 21:04 cdsteinkuehler

just exploring the boundary cases, which includes overkill.

IMO it's not completely out of scope to consider a long-running application like some automation project with an uptime of years. I wouldnt want to have assumptions like maximum runtime == 1 year before ID rollover compiled in.

But that's not going to be a problem anyway if we go the route I laid out: generator params are distributed with the initial getGeneratorId() RPC, and those work from compiled-in defaults which could potentially be overridden by config options.

the upside of all this is - we get strong type checking on serial compares (only same generatorID is legit to compare), and timestamps for timing measurement for free (haha.. need to write this code first though).

Apr 19 '14 09:04 mhaberler

On 4/19/2014 4:04 AM, Michael Haberler wrote:

IMO it's not completely out of scope to consider a long-running application like some automation project with an uptime of years. I wouldnt want to have assumptions like maximum runtime == 1 year before ID rollover compiled in.

The question is do ID values have to be unique for the life of the application or just for the life of the request? IMHO regardless of how big your ID is, you will eventually exhaust the space and eventually create overlapping ID values after some (possibly quite vast) amount of time.

I see no reason ID values HAVE to be unique for longer than the existence of the transaction, which is presumably quite short. And once you can re-use ID values safely, the required number of bits shrinks dramatically.

...but I could be missing something, and it can be handy to have nice wall-clock timestamps in the IDs to make for easier debugging. I'm just trying to understand if it's really required.

I would also say it is bad design to craft a system where rollover and ID collision cause any sort of problem at all, so why not make the rollover period fairly small (say somewhere between a few seconds and an hour)?

Charles Steinkuehler [email protected]

Apr 19 '14 11:04 cdsteinkuehler

Right. Need to exclude any chances of problems.

To distinguish one ID from another, the equality operator is sufficient. But just the equality alone doesnt give you the ordering of ID's of a given originator.

Assume for a moment you tap into several components and log ID's from a birdseye view. We know that global time doesnt make sense in a distributed system (clock skew, transmission delay leading to undecidability). So all you get a bag of ID's. But if you dont have a happened-before operator on the ID's, you cant get at the temporal and hence causal ordering of events.

I think that is a fairly important property. It's laid out here: http://en.wikipedia.org/wiki/Happened-before

Apr 19 '14 11:04 mhaberler

On 4/19/2014 6:55 AM, Michael Haberler wrote:

To distinguish one ID from another, the equality operator is sufficient. But just equality doesnt give you the ordering of ID's of a given originator.

So the question becomes which IDs need to be compared? If it's just the active IDs for a single endpoint, the order they were queued in implicitly defines happened-before.

If we need to compare IDs in a larger time horizon that includes no-longer active IDs, the questions becomes how long a horizon and how is rollover handled.

If we need to compare IDs between multiple end-points, this gets messy real fast.

Charles Steinkuehler [email protected]

Apr 19 '14 13:04 cdsteinkuehler

right, and time sortability gives us the queuing order of a single originator (endpoints might, and do receive commands from more than one originator; eg. motion with jog-while-paused: task/ui/interp is one originator, the jog-while-paused HAL driving entity another).

making the time axis a window which can be shifted by a 'reset' type operation (forget all ID's before window start): that is an option. I think resetting time is potentially non-trivial: I guess one would need to broadcast all possible generators, make sure they flush all queues, and then collect consent from all generators before proceeding.

re ID's from different endpoints: actually that was part of my post-PhD dabblings: http://goo.gl/in4tJP and http://goo.gl/6UsXNe (executive summary); we built a language (TSL - Task Sequencing Language) to express such relations in a kind of event algebra, basically as temporal assertions. The key to linking different generators is an interaction between the generators which asserts sequencing/causality (in the TSL case an Ada rendezvous). We even had partial compile-time detection for non-causal event sequences. You could tag statements with pseudo-comments and formulate an assertion like so:

begin A => B => C end

meaning to the observer (TSL runtime), events must appear in that order or the assertion was violated; the compiler checked that A, B and C either were from the same originator, or the originators were sequenced through a rendezvous (fuzzy, it's been a while). I promise not to make TSL part of machinekit, though (what a grandiose opportunity for self-citation :)

Apr 19 '14 15:04 mhaberler

On 4/19/2014 10:08 AM, Michael Haberler wrote:

right, and time sortability gives us the queuing order of a single originator (endpoints might, and do receive commands from more than one originator; eg. motion with jog-while-paused: task/ui/interp is one originator, the jog-while-paused HAL driving entity another).

You're getting back to your synchronized clocks again. I thought it was decided timestamp comparisons were valid only for IDs from the same generator. If comparison across generators is required, either the problem gets much messier, or the consumer needs to timestamp the IDs when they are received.

Again, this gets back to the question of whether or not IDs need to be compared across generators and/or across consumers, and exactly what the IDs are being used for. If it's just to track messages, I don't see the need for all the complexity or ability to sort messages from the birth of the universe to entropy death with nS resolution. :)

Charles Steinkuehler [email protected]

Apr 19 '14 15:04 cdsteinkuehler

synchronized clocks: that must be a misunderstanding. A global (=synchronized) clock is a useless concept in distributed systems with asynchronous message passing, so I'm not getting back to them - time equality in such a setup is undecidable (there is lots of research into decidability of this, and this is what Lamport says which is why he advocates local clocks). It may be a useful concept in isochronous systems like the old POTS network, and electronics, but not here.

so no, those are not synchronized - they are strictly local to a generator. It is exactly like in the Lamport paper, except that the serial carries a value which can be interpreted as time, but that is just useful for display purposes. And as such, Lamport clock equality is, as you say, meaningful only within a single generator's ID's space.

Each generator's ID's define an ordering on its subset of observed events, so that gives partial orderings. Yes, linking these into a total order becomes messy: it is impossible in the general case.

What I was trying to say in response to that, with the reference to TSL, and what Lamport paper also says: under certain conditions it is possible to correlate some partial orderings, and that is possible if two events of different generators are causally related. That is for instance the case with an Ada rendezvous, or 'wait for message reception and act' in the Hewitt Actor model. In such cases you can link two such orderings into a temporal chain of events, regardless of what time-based local clocks might say (the pure Lamport clock doesnt say anything at all about absolute time, it's just a sequence number and hence relative).

There is no extra complexity involved - we are still talking a 64bit integer after all. But I think It is helpful to clearly spell out what that integer actually means and which operations can be applied to it. Just recapitulating established CS theory, a tad late though;)

Apr 19 '14 22:04 mhaberler

On 4/19/2014 5:18 PM, Michael Haberler wrote:

under certain conditions it is possible to correlate such partial orderings, and that is possible if two events of different generators are causally related.

That all sounds good. I'm pretty sure I finally understand what you were getting at.

I do think 64-bits is a bit much for an ID, but I don't really see how it's much worse than 32-bits for all but the slowest links (where something else can likely be implemented since there are unlikely to be thousands of generators competing at nS resolution on a 9600 baud serial link, for instance) and I think dropping to 16 bits is unnecessarily limiting.

Also, I would write the code expecting the "timestamp" values to roll-over, regardless of how ridiculously long the rollover period happens to be. Murphy says that it will happen! :)

Charles Steinkuehler [email protected]

Apr 20 '14 03:04 cdsteinkuehler

yeah, I thought about this too.

The point where most processing comes in is in the originator: generating, and tacking on to a command message, then waiting for that ID to apprear in in a response message. Typically that would be commands originating somewhere in userland and sent to RT; RT processes the command, and eventually sends a reply. In that case the ID is a pass-through value; it's not inspected or modified. So it'd be mostly userland which has to work with ID contents. The cardinality of this is roughly the number of canon commands, and that normally is just a bit more than the number of Gcode lines.

The reverse case is possible theoretically and from an API point of view: a publisher sitting right in RT and producing updates. Assuming updates are used for tracking in UI's and elsewhere, the rate is likely to be lower than 20Hz (HAL tracking is now done by haltalk - a userland proxy thread scanning remote component and group scalars for change and publishing on change - so again not RT critical).

Rollover: that could be triggered by changing the default parameters of the ID generator algorithm, which I want to make config items; like setting a short mask/number of bits for the time field.

I would love to regression-test code which includes inter-component messaging. It wouldnt hurt to overhaul the basis for regression testing in LinuxCNC, but I'm not aware of a framework which is good at distributed setups.

Apr 21 '14 04:04 mhaberler

beginnings of requirements and cleanup work needed:

The 'machineflake' will consist of a local timestamp, a serial, and a generator ID. While boundaries may be configurable, the generator ID needs to run from 0..max where max fits within the bits allocated for the generator ID.

For minimum confusion, it is best to reuse the rtapi module ID (which doubles as HAL component ID). However, if this ID is used, it needs to be contiguous within the 0..max range, and this is currently not the case for rt-preempt/posix (RT comp ID's start at 32768+ to distinguish them from userland modules which start at 1+).

So there's some minor unification work needed in RTAPI to prepare for such an ID space:

there will be a getGeneratorId() operation which is available in RT as well as remotely over an RPC, so ID's are unique across an instance
the range of ID's will be: for RT modules, fixed in the 1..RTAPI_MAX_MODULES range (since this id is used as an index into an array of module descriptors); for userland modules and non-RT generator ID's, the id will run from RTAPI_MAX_MODULES+1..max (where max is set as 2^(genID-bits-1).

As a positive side effect, a flake is tagged as RT- or non-RT originated.

decision item: reuse module ID's, or not? non-reuse might imply ID space exhaustion at some point. Probably best to employ an LRU scheme for reallocation.

prepartory step: unify RTAPI module ID's to be contiguous for RT, and non-RT ID's be unique and outside the RT ID range.

Apr 24 '14 04:04 mhaberler

I've spent too much time on this, so for now I'm standing back for a simple serial/generatorID scheme (32bits each) so the concept can be worked, but the serial cannot be interpreted as a timestamp for now.

One of the issues was: find hires timing sources which are consistent across userland and RT, for all thread flavors, and which can be interpreted as wall clock time; otherwise one gets into correlating and adjusting for the difference of two different timer sources. Also, for reducing the number of bits required, a custom epoch is needed, so this gets messy very quickly.The mailing lists helped, the summary so far is:

for Xenomai: RT uses rt_timer_read() for ns timestamps; that timer can be accessed from userland using the Posix skin via clock_gettime(CLOCK_HOST_REALTIME)
for RT-PREEMPT, RT and userland uses clock_gettime(CLOCK_MONOTONIC) for now. However this returns ticks since boottime, not wall time, also it turns out this is subject to NTP adjtime() drifting. For timestamps either CLOCK_REALTIME, CLOCK_REALTIME_COARSE, or CLOCK_REALTIME_RAW look like better options.
for RTAI, right now rt_get_cpu_time_ns() is used in the kernel which looks like so:

RTIME rt_get_cpu_time_ns(void)
{
      // llimd is a fast 64/32 multiply/divide operation
      return llimd(rdtsc(), 1000000000, tuned.cpu_freq);
}

This code can be replicated on userland/RT x86 hosts but needs an offset calibration step for wall clock time.

Summary: likely a new RTAPI primitive (rtapi_get_timestamp() or so) is needed, as well as flavor-dependent userland code (ULAPI). Timestamps generated on non-RT hosts probably should use CLOCK_REALTIME.

Related: http://man7.org/linux/man-pages/man2/clock_gettime.2.html

Apr 29 '14 04:04 mhaberler

On 4/28/2014 11:01 PM, Michael Haberler wrote:

I've spent too much time on this, so for now I'm standing back for a simple serial/generatorID scheme (32bits each) so the concept can be worked, but the serial cannot be interpreted as a timestamp for now.

I'm pretty sure I said somewhere above that timestamps were optional. :)

I also recommend making it VERY easy to adjust the 32-bit size of your serial number. We will want to be able to drop that to something like 8 bits or less to test for problems with roll-over. Honestly, I'd initially develop using a small serial number field to avoid any roll-over issues at the outset, and increase the size when you get to the point where you are rapidly generating messages and need more ID space.

Charles Steinkuehler [email protected]

Apr 29 '14 10:04 cdsteinkuehler

makes sense, but in the minimum the serial space must exceed the length of the largest queue to be handled, or you'll have a wraparound event before the queue is filled, meaning you can't track entries properly anymore

eg the tp queue has DEFAULT_TC_QUEUE_SIZE 2000

Apr 29 '14 10:04 mhaberler

this is an interesting & practical paper on timestamps in distributed systems .

Jul 08 '15 05:07 mhaberler

good solution, but likely overkill: How to Have your Causality and Wall Clocks, Too by Jon Moore

Oct 15 '15 08:10 mhaberler

Thats pretty similar to how the time synchronization of FlexRay works.

Oct 17 '15 03:10 machinekoder

yesterday's talk on task showed significant interest in fixing the serial number collision problem; now that the talk established some common understanding how task and UI's interact we might refocus on the problem - it is in fact more involved than just avoiding serial collisions.

this thread diverted a bit from the original problem to the mechanics of defining serials, so let me recap what the issues actually are:

serial numbers may collide as they are created by task clients without coordination.
serial numbers in the current code base need to be unique but not monotonically increasing
the actual problem cannot be fixed by a guaranteed-to-be-unique serial alone
it probably cannot be fixed as long as we have NML/RCS as middleware.

re (1): to understand the issue, look at the code in emcmodule.cc which injects commands into task:

a serial number is created per-client in a non-unique way here
each new command message is tagged with this serial (look for all calls to next_serial)
the command message is written to the command channel
the emcWaitCommandReceived function polls the status channel for an update of EMC_STAT by task, which would contain the echo_serial_number which was filled in by task to acknowledge reception of the command
the echo_serial_number is tested for equality - if yes, the client assumes the command has been received (btw without any error checking - emcWaitCommandReceived is a void function)
the same flow is applied to test for an injected command being complete - note the comparison of status for RCS_DONE and RCS_ERROR both of which signal completion - RCS_EXEC would signify command still in progress.

re (2): I mentioned I was not sure about serial numbers being assumed to be monotonically increasing - this is not the case. To see why, one can grep for all uses of echo_serial_number in the code base and check which comparison operators are applied. This command does that: grep -r echo_serial_number src/emc/|egrep -e '(>|<)' - the hits contain no use of < or > operators. However, the patch by @sittner referenced in linuxcnc bug 328 would change that to monotonically increasing serials are assumed . A variant thereof has found its way into the linuxcnc code base, which partially fixes the problem.

re (3): unique serials are not sufficient. A second problem is the semantics of the NML status channel - it does not provide queued communication but rather has a last update counts semantics. @sittner describes the issue here: if several clients are injecting commands, the EMC_STAT structure as provided by the status channel only shows the last update, a previous update which - for whatever reasons, like random delays - has not been received in time by the originating client is lost as it is overwritten by a faster second client. So this is a race condition between several clients updating their view of EMC_STAT and task updating same.

To summarize, the "global shared buffer update" as provided by the NML status channel does not provide the semantics necessary to funnel individual acknowledgements to their respective originators.

What would be needed is an RPC-style per-client bidirectional communications pipe which has the following property:

a command injected is tagged with a unique originator id.
task must be able to monitor an arbitrary number of such per-client pipes.
Responses to a request coming in on a particular pipe must be directed only to the originating client, based on the originator id from (1).
These pipes must support queued semantics - no update may get lost either way.

(1) assures task cannot get confused where a command came from. (2) assures several UI's may work in parallel. (3) and (4) fix the lost update problem.

I note that NML does not support such an RPC-style pattern. I have tried in the past, and it is essentially not doable for the lack of queued operations, private pipes between using entities, and a lack of a originator ID mechanism which aids the identification of the originator and the routing of the replies. So frankly this might not be doable at all properly as long as we use NML.

Note that zeroMQ has all what is needed - a DEALER socket in a client and a ROUTER socket in task does provide the required semantics. zeroMQ also takes care of unique client ID's so the tuple of (client ID, serial) would still be unique. That said, it probably would still be desirable to have unique ID's because it aids tracking messages as the flow through the system, and they would be more HAL-friendly than a tuple.

So, if one were to go about to fix this design flaw once and for all:

the duplicate serials is actually not the core issue.
trying to fix this flaw within the existing NML framework is a dead end.
using zeroMQ in the command path will fix the issue for good. One option is to migrate selectively - introduce zeroMQ into task, but just have it work on command input for now while retaining the NML message format but dropping the unsuitable RCS communications path.

On a migration strategy:

I actually did some work addressing the last item but that was never merged. However, it worked as advertised, and mixing zeroMQ transport and NML message contents was not an issue.

The problem is rather isolated - the command input really. What we need is some impact analysis - need to identify all code which injects command into task, and see if they all are needed.

$ grep -rl echo_serial_number emc/
emc/usr_intf/emcsh.cc
emc/usr_intf/emclcd.cc
emc/usr_intf/emccontroller.cc
emc/usr_intf/shcom.cc
emc/usr_intf/keystick.cc
emc/usr_intf/emcrsh.cc
emc/usr_intf/axis/extensions/emcmodule.cc
emc/usr_intf/halui.cc
emc/usr_intf/xemc.cc
emc/usr_intf/schedrmt.cc
emc/iotask/ioControl.cc
emc/iotask/ioControl_v2.cc
emc/task/emctaskmain.cc
emc/task/taskintf.cc
emc/task/taskclass.cc
emc/task/iotaskintf.cc

Out of these, emcsh emclcd keystick emcrsh emcmodule halui xemc schedrmt are bona-fide UI's so those would have to migrate in concert; some might have to go anyway like schedrmt

thinking of it - there might be a way to do this piecemeal:

leave the current input channel in place so the above UI's keep working
add a parallel input channel based on zeroMQ/NML which is just an alternative way of talking to task; the old UI's would not be affected
then migrate emcmodule.cc (the 'import linuxcnc' extension) - this would use the new channel exclusively
this should actually fix the issue for all code which uses 'import linuxcnc' - which is most of the usage anyway
once that code path is known to be stable, the other UI's could be migrated at leasure, taking clues from emcmodule

machinekit machinekit copied to clipboard

Design error: command serials may collide

machinekit
machinekit copied to clipboard