perl5
perl5 copied to clipboard
hopping chars now consumes 1 hop count no matter the starting position nor direction
Previously, that was the case for backwards hops, but if a forward hop started at a continuation byte, each such byte in the current character consumed one hop count.
Should subject of pull request be hopping
rather than hoppiing
?
If I understand, won't this make:
utf8_hop_forward(p, 2, pend)
produce a different result from:
utf8_hop_forward(utf8_hop_forward(p, 1, pend), 1, pend)
?
On 7/10/22 18:55, Tony Cook wrote:
If I understand, won't this make:
|utf8_hop_forward(p, 2, pend) |
produce a different result from:
|utf8_hop_forward(utf8_hop_forward(p, 1, pend), 1, pend) |
?
No. There is no change from current behavior if the starting position is at a non-continuation.
if you have two characters, let's say the first is two bytes; the other is three. The forward by 2 will move you five bytes. The first forward by 1 will move you two bytes to the beginning of the second character; the second forward by 1 will move you an additional 3.
I was thinking for an invalid string, eg. INVARIANT CONT CONT
For the single hop 2 it skips the invariant, then the first CONT, since UTF8SKIP() for a CONT is 1.
For the 2 x hop 1 it skips the invariant on the first call, then both CONTs on the second call.
Ideally of course, we wouldn't get an invalid string, but these functions are intended to at least be safe on invalid strings.
An alternative would be to throw an exception for invalid strings, including if s starts on a continuation, but we have plenty of other functions that could be used for validation before calling utf8_hop_*()
.
The only way to avoid surprises is to always check for complete well-formedness.
But my assertion is that the prior behavior is insane for well-formed UTF-8. If you call it in the middle of a character, each continuation byte will count as a full character. The new method would automatically synchronize for you.
I'd rather have insane behavior on illegal input, and sane on legal
Do we ever call these functions where s
isn't one of: a) start
, b) end
c) the result of calling these functions on any of a, b, or c?
I'd tend to think that s being some random pointer in the string being an error in itself, but I don't recall all the circumstances we were calling these and the older functions they were intended to replace.
There are no calls outside of APItest to hopping forward where the pointer isn't at the beginning. However there are several places in the code that do want to hop from the middle to the beginning of the next character, and they rtoll-their-own code to do that.
I believe there are calls to hop back that don't start at the end or start.
The proposed interface would bring hop forward into parity with hop backward