machinekit
machinekit copied to clipboard
Design error: command serials may collide
Commands passed via NML typically have integer serial number for tracking status (received, being acted upon, completed, errored).
There is a fundamental design error: serial numbers are assigned on the originating side of a command. As soon as more than one command originator is active, this creates the danger of collisions, as NML has no identifier for an originator which could be used to make a submission ID unique (using a tuple (submitter, serial)).
An example for a hack around this issue is here: https://github.com/machinekit/machinekit/blob/master/src/emc/usr_intf/emcrsh.cc#L2842
problem: specify and implement a serial scheme which guarantees uniqueness.
This is a real issue if you like to use HALUI in together with a normal UI like AXIS (please refer to my original bug report to LinuxCNC: http://sourceforge.net/p/emc/bugs/328/)
The patch I've submitted there is more a intermediate hot fix to ensure atomic submission ID's. A real solution to this issue will require the implementation of a publish/subscribe pattern for the result notification if a asynchronous design is required or blocking calls for command submission in case of a synchronous design. In my opinion the first one is more flexible and fits better into the current design. Michael has already done a lot of work regarding this with his ZMQ/Protobuf migration.
I see several ways to deal with the issue, I'm not totally happy with either:
(1) any submitter first aquires a unique ID from an ID-issuing service. Guarantees global uniqueness and total ordering of ID's, no holes between ID's. There's a significant overhead associated with that.
(2) There's a single entity accepting commands and that entity issues ID's in response to the request. That guarantees ordering and no holes between ID's. De facto the global serial is a local variable of that entity. Unfortunately the single command acceptor is a significant restriction. On second thought, I exclude that option.
(3) There's a service which hands out serial number ranges, eg in blocks of say 1000 for the sake of example. Any command submitter aquires a block of ID's first, and whenever that block is depleted. Upside: total ordering, uniqueness, efficient. Downside: ID ranges might have holes (i.e. not strictly sequential). Bonus upside: the block issuing service may identify the source of a given ID in case it keeps a table {socket name, ID range}.
In tendency I'm leaning towards (3); the natural place for ID management is atomic ID allocating code code which already is in rtapi_msgd (for HAL module ID's etc). That range would be unique for an RT instance.
The only questions warranting more consideration are:
is there any downside to having 'holes' between ID's; I dont think there is; the 'equals' and 'greater than' operators are required, but those would not be affected by holes.
Restriction on ID's: they must fit with a HAL data type, i.e. for now 32bit ints. It should be straightforward to do 64bit scalars, but a UUID is out of scope (128bits required).
Question: are 'Lamport timestamps' OK as ID's? http://en.wikipedia.org/wiki/Lamport_timestamps (in other words: must the ID's support determining the order of events per originator, but not globally)
Question: sortability of ID's: certainly must hold for ID's of a single originator, but I do not see q requirement for sortability on the global level
the seminal Lamport paper: http://www.stanford.edu/class/cs240/readings/lamport.pdf
Here's an interesting presentation on the issue: http://de.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems , covering the theme from various angles
Also http://stackoverflow.com/questions/2671858/distributed-sequence-number-generation discusses some options.
The 'Twitter snowflake' sounds interesting as a basis: Snowflake IDs are 64-bits, half the size of a UUID Can use time as first component and remain sortable Distributed system that can survive nodes dying
The Instagram ID's would probably work, also requires 64bits: http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram IDs consists of:
41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch) 13 bits that represent the logical shard ID 10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond
Another option is simpleflake: http://engineering.custommade.com/simpleflake-distributed-id-generation-for-the-lazy/
A lamport clock, but UUID-size (128bits): https://github.com/jamesgolick/lexical_uuid.erl
On 4/17/2014 10:59 AM, Michael Haberler wrote:
The 'Twitter snowflake' sounds interesting as a basis; I would like to avoid time-synchronisation as requirement though.
Time synchronization is not required, unless you need to be able to (trivially) sort the ID's by generation time. Even without synchronized clocks, you can align the ID's if debugging by using the generator ID and a time offset for each generator.
The basic idea seems to be that each ID is composed of:
-
A guaranteed unique generator ID (which could perhaps be assigned as part of the subscribe/publish protocol?)
-
Enough bits created and assigned locally by the generator to insure no two local IDs are ever the same. Optionally, this value can include a timestamp and be constructed such that it is easy to order the events by creation time.
Charles Steinkuehler [email protected]
yes, that's the idea. And time synchronization is a non-requirement, my bad. And global time is not a useful concept for event ordering in such a system anyway.
The generator ID will need to include the HAL module ID's because those are originators/consumers too, not just remote entities like UI's. So something like a remote getNextModuleId() service covers that, and ID's would unambiguously tagged as originating from a certain module (might even help debugging doing that).
As for the type of system (see http://en.wikipedia.org/wiki/Lamport_timestamps#Lamport.27s_logical_clock_in_distributed_systems) I think what we have at hand is a system with partial order between processes, suggesting this concept is the right fit
Actually I see ID's based on Lamport clocks as having upside: right now serials are int's which are happily compared with no safeguards as to origin; but comparing timestamps from different originators makes no sense by definition. But defining happened-before and equality operators on such a datatype would be able to raise an exception if timestamps of different origin are compared, which is a conceptual error. And we get a timestamp for free.
I guess next step is to do some back-of-envelope calculations to derive sensible boundaries for timestamp granularity, serial and module ID which cover the execution model for the forseeable future. I'll make it a macro and #defines just in case.
let's start with a modified Instagram scheme. Assume 64bits timestamp size.
I'll swap the sequence number and 'shard ID' fields in size since we have different size and accuracy requirements That leaves 10bits or a maximum of 1024 for the moduleID.
Each of our IDs consists of:
- 41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
- 13 bits that represent an auto-incrementing sequence, modulus 8192.
- 10 bits that represent the generator ID, maxing out at 1024 generators.
This means we can generate 8192 IDs, per module, per millisecond, or every 120nS; meaning a rate of 8 million messages/second. This would be good enough for most scenarios except very tight loops (even ringbuffer operations take 100-300nS/message on my Mac), and that scenario is unlikely.
A solution could be: a generator is required to check uniqueness of an ID (this would happen only if called less than 125nS apart, so unlikely, and very low delay; that could even be done as busy-loop in an RT thread).
I think it's safe to assume that an ensemble of more than 1024 components (HAL or otherwise) talking to each other is unlikely to happen.
This timestamp is absolute for the next 41 years. Note the references to Twitter, Instagram etc cover a requirement which we dont have: these ID's are for permanent storage in databases, unambiguously tagging an item with an absolute time reference. We dont have that requirement because ID's need not be persistent across sessions.
There's one variant which might warrant consideration: use timestamps relative to startup. If we assume a maximum instance lifetime of say 1 year, that would gain us 5 bits off the timestamp.
If we put these 5bits to work in the serial (so the cut would be 36bits relative timestamp to startup, 18bits sequence, 10bits generator ID), that would mean we can generate 262 mio ID's per second or roughly a unique ID every 4nS.
Cost/benefit versus the first variant as I see it:
- decoding this timestamp into absolute time requires the startup epoch (but since generator ID needs to be managed per-instance there's no extra cost)
- uniqueness even in tight loops pretty much guaranteed (but not much of an upside)
- the assumed maximum instance lifetime could turn out to be a bummer if this is used for long-running processes like automation.
In fact it might not be necessary to make a hard decision on this anyway. Since we need to centrally manage the generatorID (which translates into a RPC at startup time to retrieve the next unique generator ID) we could just tack on the startup epoch, masks and shift counts as variables to this reply, making the whole decision a configuration time issue.
On 4/18/2014 3:18 AM, Michael Haberler wrote:
There's one variant which might warrant consideration: use timestamps relative to startup. If we assume a maximum instance lifetime of say 1 year, that would gain us 5 bits off the timestamp.
If we put these 5bits to work in the serial (so the cut would be 36bits relative timestamp to startup, 18bits sequence, 10bits generator ID), that would mean we can generate 262 mio ID's per second or roughly a unique ID every 4nS.
IMHO, all of this is likely massive overkill. The systems I work with get 5-8 bits of unique transaction identifier to use. That is combined with a unique originator ID, but the result is less than 32-bits.
How long is it necessary for a transaction ID to exist? Typically, in high-speed systems like you're talking about (generating IDs on a nanosecond time-scale), the ID only needs to survive for as long as the particular transaction it's identifying.
PCI Express, for example, uses a 16-bit "generator ID" (8-bit bus ID, + 5 bit device number + 3 bit function), combined with up to 8-bits of "tag" that is assigned by the generator. That still leaves 8-bits free in a 32-bit word...exactly how "unique" do these tags need to be?
Charles Steinkuehler [email protected]
just exploring the boundary cases, which includes overkill.
IMO it's not completely out of scope to consider a long-running application like some automation project with an uptime of years. I wouldnt want to have assumptions like maximum runtime == 1 year before ID rollover compiled in.
But that's not going to be a problem anyway if we go the route I laid out: generator params are distributed with the initial getGeneratorId() RPC, and those work from compiled-in defaults which could potentially be overridden by config options.
the upside of all this is - we get strong type checking on serial compares (only same generatorID is legit to compare), and timestamps for timing measurement for free (haha.. need to write this code first though).
On 4/19/2014 4:04 AM, Michael Haberler wrote:
IMO it's not completely out of scope to consider a long-running application like some automation project with an uptime of years. I wouldnt want to have assumptions like maximum runtime == 1 year before ID rollover compiled in.
The question is do ID values have to be unique for the life of the application or just for the life of the request? IMHO regardless of how big your ID is, you will eventually exhaust the space and eventually create overlapping ID values after some (possibly quite vast) amount of time.
I see no reason ID values HAVE to be unique for longer than the existence of the transaction, which is presumably quite short. And once you can re-use ID values safely, the required number of bits shrinks dramatically.
...but I could be missing something, and it can be handy to have nice wall-clock timestamps in the IDs to make for easier debugging. I'm just trying to understand if it's really required.
I would also say it is bad design to craft a system where rollover and ID collision cause any sort of problem at all, so why not make the rollover period fairly small (say somewhere between a few seconds and an hour)?
Charles Steinkuehler [email protected]
Right. Need to exclude any chances of problems.
To distinguish one ID from another, the equality operator is sufficient. But just the equality alone doesnt give you the ordering of ID's of a given originator.
Assume for a moment you tap into several components and log ID's from a birdseye view. We know that global time doesnt make sense in a distributed system (clock skew, transmission delay leading to undecidability). So all you get a bag of ID's. But if you dont have a happened-before operator on the ID's, you cant get at the temporal and hence causal ordering of events.
I think that is a fairly important property. It's laid out here: http://en.wikipedia.org/wiki/Happened-before
On 4/19/2014 6:55 AM, Michael Haberler wrote:
To distinguish one ID from another, the equality operator is sufficient. But just equality doesnt give you the ordering of ID's of a given originator.
So the question becomes which IDs need to be compared? If it's just the active IDs for a single endpoint, the order they were queued in implicitly defines happened-before.
If we need to compare IDs in a larger time horizon that includes no-longer active IDs, the questions becomes how long a horizon and how is rollover handled.
If we need to compare IDs between multiple end-points, this gets messy real fast.
Charles Steinkuehler [email protected]
right, and time sortability gives us the queuing order of a single originator (endpoints might, and do receive commands from more than one originator; eg. motion with jog-while-paused: task/ui/interp is one originator, the jog-while-paused HAL driving entity another).
making the time axis a window which can be shifted by a 'reset' type operation (forget all ID's before window start): that is an option. I think resetting time is potentially non-trivial: I guess one would need to broadcast all possible generators, make sure they flush all queues, and then collect consent from all generators before proceeding.
re ID's from different endpoints: actually that was part of my post-PhD dabblings: http://goo.gl/in4tJP and http://goo.gl/6UsXNe (executive summary); we built a language (TSL - Task Sequencing Language) to express such relations in a kind of event algebra, basically as temporal assertions. The key to linking different generators is an interaction between the generators which asserts sequencing/causality (in the TSL case an Ada rendezvous). We even had partial compile-time detection for non-causal event sequences. You could tag statements with pseudo-comments and formulate an assertion like so:
begin A => B => C end
meaning to the observer (TSL runtime), events must appear in that order or the assertion was violated; the compiler checked that A, B and C either were from the same originator, or the originators were sequenced through a rendezvous (fuzzy, it's been a while). I promise not to make TSL part of machinekit, though (what a grandiose opportunity for self-citation :)
On 4/19/2014 10:08 AM, Michael Haberler wrote:
right, and time sortability gives us the queuing order of a single originator (endpoints might, and do receive commands from more than one originator; eg. motion with jog-while-paused: task/ui/interp is one originator, the jog-while-paused HAL driving entity another).
You're getting back to your synchronized clocks again. I thought it was decided timestamp comparisons were valid only for IDs from the same generator. If comparison across generators is required, either the problem gets much messier, or the consumer needs to timestamp the IDs when they are received.
Again, this gets back to the question of whether or not IDs need to be compared across generators and/or across consumers, and exactly what the IDs are being used for. If it's just to track messages, I don't see the need for all the complexity or ability to sort messages from the birth of the universe to entropy death with nS resolution. :)
Charles Steinkuehler [email protected]
synchronized clocks: that must be a misunderstanding. A global (=synchronized) clock is a useless concept in distributed systems with asynchronous message passing, so I'm not getting back to them - time equality in such a setup is undecidable (there is lots of research into decidability of this, and this is what Lamport says which is why he advocates local clocks). It may be a useful concept in isochronous systems like the old POTS network, and electronics, but not here.
so no, those are not synchronized - they are strictly local to a generator. It is exactly like in the Lamport paper, except that the serial carries a value which can be interpreted as time, but that is just useful for display purposes. And as such, Lamport clock equality is, as you say, meaningful only within a single generator's ID's space.
Each generator's ID's define an ordering on its subset of observed events, so that gives partial orderings. Yes, linking these into a total order becomes messy: it is impossible in the general case.
What I was trying to say in response to that, with the reference to TSL, and what Lamport paper also says: under certain conditions it is possible to correlate some partial orderings, and that is possible if two events of different generators are causally related. That is for instance the case with an Ada rendezvous, or 'wait for message reception and act' in the Hewitt Actor model. In such cases you can link two such orderings into a temporal chain of events, regardless of what time-based local clocks might say (the pure Lamport clock doesnt say anything at all about absolute time, it's just a sequence number and hence relative).
There is no extra complexity involved - we are still talking a 64bit integer after all. But I think It is helpful to clearly spell out what that integer actually means and which operations can be applied to it. Just recapitulating established CS theory, a tad late though;)
On 4/19/2014 5:18 PM, Michael Haberler wrote:
under certain conditions it is possible to correlate such partial orderings, and that is possible if two events of different generators are causally related.
That all sounds good. I'm pretty sure I finally understand what you were getting at.
I do think 64-bits is a bit much for an ID, but I don't really see how it's much worse than 32-bits for all but the slowest links (where something else can likely be implemented since there are unlikely to be thousands of generators competing at nS resolution on a 9600 baud serial link, for instance) and I think dropping to 16 bits is unnecessarily limiting.
Also, I would write the code expecting the "timestamp" values to roll-over, regardless of how ridiculously long the rollover period happens to be. Murphy says that it will happen! :)
Charles Steinkuehler [email protected]
yeah, I thought about this too.
The point where most processing comes in is in the originator: generating, and tacking on to a command message, then waiting for that ID to apprear in in a response message. Typically that would be commands originating somewhere in userland and sent to RT; RT processes the command, and eventually sends a reply. In that case the ID is a pass-through value; it's not inspected or modified. So it'd be mostly userland which has to work with ID contents. The cardinality of this is roughly the number of canon commands, and that normally is just a bit more than the number of Gcode lines.
The reverse case is possible theoretically and from an API point of view: a publisher sitting right in RT and producing updates. Assuming updates are used for tracking in UI's and elsewhere, the rate is likely to be lower than 20Hz (HAL tracking is now done by haltalk - a userland proxy thread scanning remote component and group scalars for change and publishing on change - so again not RT critical).
Rollover: that could be triggered by changing the default parameters of the ID generator algorithm, which I want to make config items; like setting a short mask/number of bits for the time field.
I would love to regression-test code which includes inter-component messaging. It wouldnt hurt to overhaul the basis for regression testing in LinuxCNC, but I'm not aware of a framework which is good at distributed setups.
beginnings of requirements and cleanup work needed:
The 'machineflake' will consist of a local timestamp, a serial, and a generator ID. While boundaries may be configurable, the generator ID needs to run from 0..max where max fits within the bits allocated for the generator ID.
For minimum confusion, it is best to reuse the rtapi module ID (which doubles as HAL component ID). However, if this ID is used, it needs to be contiguous within the 0..max range, and this is currently not the case for rt-preempt/posix (RT comp ID's start at 32768+ to distinguish them from userland modules which start at 1+).
So there's some minor unification work needed in RTAPI to prepare for such an ID space:
- there will be a getGeneratorId() operation which is available in RT as well as remotely over an RPC, so ID's are unique across an instance
- the range of ID's will be: for RT modules, fixed in the 1..RTAPI_MAX_MODULES range (since this id is used as an index into an array of module descriptors); for userland modules and non-RT generator ID's, the id will run from RTAPI_MAX_MODULES+1..max (where max is set as 2^(genID-bits-1).
As a positive side effect, a flake is tagged as RT- or non-RT originated.
decision item: reuse module ID's, or not? non-reuse might imply ID space exhaustion at some point. Probably best to employ an LRU scheme for reallocation.
prepartory step: unify RTAPI module ID's to be contiguous for RT, and non-RT ID's be unique and outside the RT ID range.
I've spent too much time on this, so for now I'm standing back for a simple serial/generatorID scheme (32bits each) so the concept can be worked, but the serial cannot be interpreted as a timestamp for now.
One of the issues was: find hires timing sources which are consistent across userland and RT, for all thread flavors, and which can be interpreted as wall clock time; otherwise one gets into correlating and adjusting for the difference of two different timer sources. Also, for reducing the number of bits required, a custom epoch is needed, so this gets messy very quickly.The mailing lists helped, the summary so far is:
- for Xenomai: RT uses rt_timer_read() for ns timestamps; that timer can be accessed from userland using the Posix skin via clock_gettime(CLOCK_HOST_REALTIME)
- for RT-PREEMPT, RT and userland uses clock_gettime(CLOCK_MONOTONIC) for now. However this returns ticks since boottime, not wall time, also it turns out this is subject to NTP adjtime() drifting. For timestamps either CLOCK_REALTIME, CLOCK_REALTIME_COARSE, or CLOCK_REALTIME_RAW look like better options.
- for RTAI, right now rt_get_cpu_time_ns() is used in the kernel which looks like so:
RTIME rt_get_cpu_time_ns(void)
{
// llimd is a fast 64/32 multiply/divide operation
return llimd(rdtsc(), 1000000000, tuned.cpu_freq);
}
This code can be replicated on userland/RT x86 hosts but needs an offset calibration step for wall clock time.
Summary: likely a new RTAPI primitive (rtapi_get_timestamp() or so) is needed, as well as flavor-dependent userland code (ULAPI). Timestamps generated on non-RT hosts probably should use CLOCK_REALTIME.
Related: http://man7.org/linux/man-pages/man2/clock_gettime.2.html
On 4/28/2014 11:01 PM, Michael Haberler wrote:
I've spent too much time on this, so for now I'm standing back for a simple serial/generatorID scheme (32bits each) so the concept can be worked, but the serial cannot be interpreted as a timestamp for now.
I'm pretty sure I said somewhere above that timestamps were optional. :)
I also recommend making it VERY easy to adjust the 32-bit size of your serial number. We will want to be able to drop that to something like 8 bits or less to test for problems with roll-over. Honestly, I'd initially develop using a small serial number field to avoid any roll-over issues at the outset, and increase the size when you get to the point where you are rapidly generating messages and need more ID space.
Charles Steinkuehler [email protected]
makes sense, but in the minimum the serial space must exceed the length of the largest queue to be handled, or you'll have a wraparound event before the queue is filled, meaning you can't track entries properly anymore
eg the tp queue has DEFAULT_TC_QUEUE_SIZE 2000
this is an interesting & practical paper on timestamps in distributed systems .
good solution, but likely overkill: How to Have your Causality and Wall Clocks, Too by Jon Moore
Thats pretty similar to how the time synchronization of FlexRay works.
yesterday's talk on task showed significant interest in fixing the serial number collision problem; now that the talk established some common understanding how task and UI's interact we might refocus on the problem - it is in fact more involved than just avoiding serial collisions.
this thread diverted a bit from the original problem to the mechanics of defining serials, so let me recap what the issues actually are:
- serial numbers may collide as they are created by task clients without coordination.
- serial numbers in the current code base need to be unique but not monotonically increasing
- the actual problem cannot be fixed by a guaranteed-to-be-unique serial alone
- it probably cannot be fixed as long as we have NML/RCS as middleware.
re (1): to understand the issue, look at the code in emcmodule.cc which injects commands into task:
- a serial number is created per-client in a non-unique way here
- each new command message is tagged with this serial (look for all calls to
next_serial) - the command message is written to the command channel
- the emcWaitCommandReceived function polls the status channel for an update of EMC_STAT by task, which would contain the echo_serial_number which was filled in by task to acknowledge reception of the command
- the echo_serial_number is tested for equality - if yes, the client assumes the command has been received (btw without any error checking - emcWaitCommandReceived is a void function)
- the same flow is applied to test for an injected command being complete - note the comparison of status for RCS_DONE and RCS_ERROR both of which signal completion - RCS_EXEC would signify command still in progress.
re (2): I mentioned I was not sure about serial numbers being assumed to be monotonically increasing - this is not the case. To see why, one can grep for all uses of echo_serial_number in the code base and check which comparison operators are applied. This command does that: grep -r echo_serial_number src/emc/|egrep -e '(>|<)' - the hits contain no use of < or > operators. However, the patch by @sittner referenced in linuxcnc bug 328 would change that to monotonically increasing serials are assumed . A variant thereof has found its way into the linuxcnc code base, which partially fixes the problem.
re (3): unique serials are not sufficient. A second problem is the semantics of the NML status channel - it does not provide queued communication but rather has a last update counts semantics. @sittner describes the issue here: if several clients are injecting commands, the EMC_STAT structure as provided by the status channel only shows the last update, a previous update which - for whatever reasons, like random delays - has not been received in time by the originating client is lost as it is overwritten by a faster second client. So this is a race condition between several clients updating their view of EMC_STAT and task updating same.
To summarize, the "global shared buffer update" as provided by the NML status channel does not provide the semantics necessary to funnel individual acknowledgements to their respective originators.
What would be needed is an RPC-style per-client bidirectional communications pipe which has the following property:
- a command injected is tagged with a unique originator id.
- task must be able to monitor an arbitrary number of such per-client pipes.
- Responses to a request coming in on a particular pipe must be directed only to the originating client, based on the originator id from (1).
- These pipes must support queued semantics - no update may get lost either way.
(1) assures task cannot get confused where a command came from. (2) assures several UI's may work in parallel. (3) and (4) fix the lost update problem.
I note that NML does not support such an RPC-style pattern. I have tried in the past, and it is essentially not doable for the lack of queued operations, private pipes between using entities, and a lack of a originator ID mechanism which aids the identification of the originator and the routing of the replies. So frankly this might not be doable at all properly as long as we use NML.
Note that zeroMQ has all what is needed - a DEALER socket in a client and a ROUTER socket in task does provide the required semantics. zeroMQ also takes care of unique client ID's so the tuple of (client ID, serial) would still be unique. That said, it probably would still be desirable to have unique ID's because it aids tracking messages as the flow through the system, and they would be more HAL-friendly than a tuple.
So, if one were to go about to fix this design flaw once and for all:
- the duplicate serials is actually not the core issue.
- trying to fix this flaw within the existing NML framework is a dead end.
- using zeroMQ in the command path will fix the issue for good. One option is to migrate selectively - introduce zeroMQ into task, but just have it work on command input for now while retaining the NML message format but dropping the unsuitable RCS communications path.
On a migration strategy:
I actually did some work addressing the last item but that was never merged. However, it worked as advertised, and mixing zeroMQ transport and NML message contents was not an issue.
The problem is rather isolated - the command input really. What we need is some impact analysis - need to identify all code which injects command into task, and see if they all are needed.
$ grep -rl echo_serial_number emc/
emc/usr_intf/emcsh.cc
emc/usr_intf/emclcd.cc
emc/usr_intf/emccontroller.cc
emc/usr_intf/shcom.cc
emc/usr_intf/keystick.cc
emc/usr_intf/emcrsh.cc
emc/usr_intf/axis/extensions/emcmodule.cc
emc/usr_intf/halui.cc
emc/usr_intf/xemc.cc
emc/usr_intf/schedrmt.cc
emc/iotask/ioControl.cc
emc/iotask/ioControl_v2.cc
emc/task/emctaskmain.cc
emc/task/taskintf.cc
emc/task/taskclass.cc
emc/task/iotaskintf.cc
Out of these, emcsh emclcd keystick emcrsh emcmodule halui xemc schedrmt are bona-fide UI's so those would have to migrate in concert; some might have to go anyway like schedrmt
thinking of it - there might be a way to do this piecemeal:
- leave the current input channel in place so the above UI's keep working
- add a parallel input channel based on zeroMQ/NML which is just an alternative way of talking to task; the old UI's would not be affected
- then migrate emcmodule.cc (the 'import linuxcnc' extension) - this would use the new channel exclusively
- this should actually fix the issue for all code which uses 'import linuxcnc' - which is most of the usage anyway
- once that code path is known to be stable, the other UI's could be migrated at leasure, taking clues from emcmodule
see also the list post
I actually did some work addressing the last item but that was never merged. However, it worked as advertised, and mixing zeroMQ transport and NML message contents was not an issue.
Can you point to this as a starter, in your repo or wherever.
this is the (branches - for-sascha, and the underlying zmq-submit-task.commands) http://git.mah.priv.at/gitweb?p=emc2-dev.git;a=shortlog;h=refs/heads/for-sascha
this was about using the 'cheesestand ticket' analogy I referred to, so that is just one option
when reading this branch, I suggest to concentrate on the diffs of src/emc/task/emctaskmain.cc ("server" side) and src/emc/user_intf/axis/extensions/emcmodule.cc ("client" side)
I did some writeup on migration back then, but that was about taking too big a bite at one step (including protobuf and whatnot) - take with a big lump of salt
Also since @einstine909 is now on the canon case, we will have several options downstream
But IMO just adding a zeroMQ path in parallel to the NML input channel looks perfectly doable and pretty well isolated
also see the standalone command injection demo
not sure all this still works, ping me before doing anything more than reading, then I'll rebase onto mk first if you want
I suspected you had already 'invented the wheel', just needed a tyre change :smile: Won't be able to look properly for a few days.
emcmodule.cc doesn't do anything for me, but I thought of implementing the server in emctask as a parallel command insertion route and modifying a copy of my C++ access libraries to pump their commands by that route.
If that worked, someone would just need to convert perfectly good C++ code into parsel-tongue bindings, to get the python equivalent :snake:
oh, let me have a look in the evening, I'll clean up and rebase, and review what I actually did (and what actually worked)
it could be just that the standalone injection demo was what worked, and I had not touched emcmodule yet at all
on the parsel tongue thingie replacing 'import linuxcnc' : almost. zeroMQ would be just fine as handled by 'import zmq'. The bummer is: as long as we use NML we need to remain in C++ land. As soon as the message content is encoded in protobuf its all Python (and downhill from there ;)
@mhaberler Saw your talk yesterday, thanks for all the information!
Avoiding the duplicate serials globally is a good idea. Regarding using ZeroMQ, does that mean milltask is being changed to have something like http://zguide.zeromq.org/page:all#Shared-Queue-DEALER-and-ROUTER-sockets (Figure 16)?
@mhaberler
Might it be possible to write a wrapper around zeromq to make it have the NML interface and just drop it in everywhere? Then migrate code to use zeromq directly as appropriate (where additional functionality is needed)?
Ken