code128: Add minimal encodation algorithm (non-extended ASCII only)
Adapted from ZXing (props Alex Geller) - maybe 80% slower depending on data & stack heavy but does improve some outcomes when FNC1s present (GS1 or manual) although not much else it appears (the previous algorithm was pretty good)
Prompted by tests added from PR #272, props lyngklip
This is the second of the alternative PRs (PR #275). You choose!
I'm wondering about the first Code 128 test case. I suspect a decoder might add 128 to those '9' digits because extended ASCII is active?
Edit: it seems like a grey area. I have no idea what common practice is. This might be good.
Yes you might think that but extended mode only applies to "the ISO/IEC 646 value", i.e. to Code Sets A and B ASCII values, not to Code Set C double digit values, which aren't ASCII, so C mode stuff can be freely intermixed with extended mode shifts and latches.
ChatGPT seems to agree with you once I helped it understand the question. That makes encoding a bit more complex, right - interesting. I was under the impression that FNC4 insertion was a "preprocessing" step. This would mean that sequences of extended ASCII that mapped to ASCII digits might be encoded in character set C, and that does seem a bit odd even though it would not a problem so long as encoder and decoder agrees. But that's what the encoder did before, right?
Well that was a bug, which PR #275 fixes. Extended ASCII should never be encoded as Code Set C digits.
No time to weigh in right now, but I'll likely take this PR (over the other) once I've had time to review.
@gitlost Did the basis for the algorithm get written up anywhere? If we're significantly deviate from informative routine provided in the symbology specs then we should have some reference to signpost users to. (I've had to expend a lot of effort over the years convincing developers of pathological decoders that they need to fix decoding bugs, even if the codeword sequence is not the result of a reference encoder.)
The only write up really is that it's a standard algorithm, e.g. https://en.wikipedia.org/wiki/Divide-and-conquer_algorithm.
I'm concerned about performance, both speed and stack usage, so I'd hesitate to use it without trying it out first in some real-life cases if that's possible.
The performance checking I did I wouldn't be confident in, being just loops using usertime for timings with garbage collection turned off (-2 vmreclaim).
Here's the very simplistic performance test I used (mode128 is the Divide-and-Conquer one, code128 is the current):
2 vmreclaim
-2 vmreclaim
/tot 0 def
/startt usertime def
1 1 100 {
(^031^031_^127^159^031^159^159^159^15912345``^255^000^127^255^224^224^159`) (dontdraw parse) /mode128 /uk.co.terryburton.bwipp findresource exec
} for
/endt usertime def
/tot tot endt startt sub add def
(mode128 tot ) print tot ==
2 vmreclaim
-2 vmreclaim
/tot 0 def
/startt usertime def
1 1 100 {
(^031^031_^127^159^031^159^159^159^15912345``^255^000^127^255^224^224^159`) (dontdraw parse) /code128 /uk.co.terryburton.bwipp findresource exec
} for
/endt usertime def
/tot tot endt startt sub add def
(code128 tot ) print tot ==
2 vmreclaim
-2 vmreclaim
/tot 0 def
/startt usertime def
1 1 100 {
(^031^031_^127^159^031^159^159^159^15912345``^255^000^127^255^224^224^159`) (dontdraw parse) /mode128 /uk.co.terryburton.bwipp findresource exec
} for
/endt usertime def
/tot tot endt startt sub add def
(mode128 tot ) print tot ==
2 vmreclaim
-2 vmreclaim
/tot 0 def
/startt usertime def
1 1 100 {
(^031^031_^127^159^031^159^159^159^15912345``^255^000^127^255^224^224^159`) (dontdraw parse) /code128 /uk.co.terryburton.bwipp findresource exec
} for
/endt usertime def
/tot tot endt startt sub add def
(code128 tot ) print tot ==
Something that I've been thinking about: should it be made explicit what the encoder does when faced with the possibility of encoding part of the message in two different ways with the same length:
- prefer ASCII > Extended or vice versa
- prefer A>B>C or B>A>C or C>A>B etc.
- prefer range shift to range switch or...
- prefer character set shift to character set switch or...
- prefer dangling digit at the front of an odd digit span or at the end
- switch from ASCII A to Extended B using 100 100 100 or 101 101 100 etc.
There are possibly more alternatives than the ones I have listed. The reason I ask is that I've been playing with a somewhat rewritten encoder and I have come up with something where I can sort of control the priority of these things, but I still can't match all the test cases. The test cases in some places seem to prefer range switching over shifting and in other places the other way around. Unfortunately I have no insight into specifications.
Unfortunately I have no insight into specifications.
In theory, the examples from the initial ISO/IEC 15417:2000 specification were based on this code
However I have not verified it.
Closing in favour of PR #278