perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

hopping chars now consumes 1 hop count no matter the starting position nor direction

Open khwilliamson opened this issue 2 years ago • 7 comments

Previously, that was the case for backwards hops, but if a forward hop started at a continuation byte, each such byte in the current character consumed one hop count.

khwilliamson avatar Jul 10 '22 16:07 khwilliamson

Should subject of pull request be hopping rather than hoppiing?

jkeenan avatar Jul 10 '22 18:07 jkeenan

If I understand, won't this make:

utf8_hop_forward(p, 2, pend)

produce a different result from:

utf8_hop_forward(utf8_hop_forward(p, 1, pend), 1, pend)

?

tonycoz avatar Jul 11 '22 00:07 tonycoz

On 7/10/22 18:55, Tony Cook wrote:

If I understand, won't this make:

|utf8_hop_forward(p, 2, pend) |

produce a different result from:

|utf8_hop_forward(utf8_hop_forward(p, 1, pend), 1, pend) |

?

No. There is no change from current behavior if the starting position is at a non-continuation.

if you have two characters, let's say the first is two bytes; the other is three. The forward by 2 will move you five bytes. The first forward by 1 will move you two bytes to the beginning of the second character; the second forward by 1 will move you an additional 3.

khwilliamson avatar Jul 11 '22 01:07 khwilliamson

I was thinking for an invalid string, eg. INVARIANT CONT CONT

For the single hop 2 it skips the invariant, then the first CONT, since UTF8SKIP() for a CONT is 1.

For the 2 x hop 1 it skips the invariant on the first call, then both CONTs on the second call.

Ideally of course, we wouldn't get an invalid string, but these functions are intended to at least be safe on invalid strings.

An alternative would be to throw an exception for invalid strings, including if s starts on a continuation, but we have plenty of other functions that could be used for validation before calling utf8_hop_*().

tonycoz avatar Jul 13 '22 00:07 tonycoz

The only way to avoid surprises is to always check for complete well-formedness.

But my assertion is that the prior behavior is insane for well-formed UTF-8. If you call it in the middle of a character, each continuation byte will count as a full character. The new method would automatically synchronize for you.

I'd rather have insane behavior on illegal input, and sane on legal

khwilliamson avatar Jul 13 '22 04:07 khwilliamson

Do we ever call these functions where s isn't one of: a) start, b) end c) the result of calling these functions on any of a, b, or c?

I'd tend to think that s being some random pointer in the string being an error in itself, but I don't recall all the circumstances we were calling these and the older functions they were intended to replace.

tonycoz avatar Jul 13 '22 06:07 tonycoz

There are no calls outside of APItest to hopping forward where the pointer isn't at the beginning. However there are several places in the code that do want to hop from the middle to the beginning of the next character, and they rtoll-their-own code to do that.

I believe there are calls to hop back that don't start at the end or start.

The proposed interface would bring hop forward into parity with hop backward

khwilliamson avatar Jul 16 '22 14:07 khwilliamson