jq icon indicating copy to clipboard operation
jq copied to clipboard

Use codepoint index for indices/1, index/1 and rindex/1

Open wader opened this issue 1 year ago • 6 comments

Previsouly byte index was used.

Fixes #1430, fixes #1624, fixes #3064.

wader avatar Mar 12 '24 09:03 wader

Haven't entirely convinced myself yet that it should be fine to look for matches using the byte representation. Assuming both the needle and haystack is valid utf-8 i'm thinking it should be fine because of utf-8's self-synchronization property.

Update: now looking at jv_string_slice https://github.com/jqlang/jq/blob/c95b34ff827d05a2d262f00280a4891a295ed0ed/src/jv.c#L1374 i'm not sure anymore if one can assume strings are valid utf-8 or is the invalid utf-8 checks not really needed?

wader avatar Mar 12 '24 09:03 wader

I'd like to include this. Any objection on changing the behavior in 1.8?

itchyny avatar Aug 20 '24 12:08 itchyny

Ok to merge for me but would be great if someone could have a look or know if my assumption about strings always being valid utf-8 is true.

wader avatar Aug 24 '24 14:08 wader

@itchyny asked:

Any objection on changing the behavior in 1.8

This is a major breaking change and it has been my understanding for some years that such changes would have to wait until jq 2.0. Certainly if we were following a strict SemVer policy that would be the case. Since we don't seem to be doing so, the situation is not black-and-white, but if the change is incorporated into 1.8, we should be sure to highlight it.

@wader wrote:

if my assumption about strings always being valid utf-8 is true.

Based on past experience, such an assumption would not be warranted, so the question is: could the proposed changes make anything worse? I suppose the major issue would be whether (in the presence of invalid utf-8) the old index would give an accurate byte count but the new version might give an inaccurate codepoint count.

Perhaps a starting point would be "a\uDD1Ec":

echo '"a\uDD1Ec"' | jaq -r .
Error: failed to parse: invalid character with index 56606

echo '"a\uDD1Ec"' | jq -c '[index("c"), length]'
[4,3]

pkoppstein avatar Aug 24 '24 17:08 pkoppstein

@itchyny asked:

Any objection on changing the behavior in 1.8

This is a major breaking change and it has been my understanding for some years that such changes would have to wait until jq 2.0. Certainly if we were following a strict SemVer policy that would be the case. Since we don't seem to be doing so, the situation is not black-and-white, but if the change is incorporated into 1.8, we should be sure to highlight it.

I can't see how the current behaviour for non-ASCII strings makes any sense or could even be useful in any resonable way? so for me it feels more like a bug.

@wader wrote:

if my assumption about strings always being valid utf-8 is true.

Based on past experience, such an assumption would not be warranted, so the question is: could the proposed changes make anything worse? I suppose the major issue would be whether (in the presence of invalid utf-8) the old index would give an accurate byte count but the new version might give an inaccurate codepoint count.

Perhaps a starting point would be "a\uDD1Ec":

echo '"a\uDD1Ec"' | jaq -r .
Error: failed to parse: invalid character with index 56606

echo '"a\uDD1Ec"' | jq -c '[index("c"), length]'
[4,3]

This is an incomplete surrogates pair? yeap stuff like this i'm concerned about also.

wader avatar Aug 24 '24 17:08 wader

With this change:

$ echo '"a\uDD1Ec"' | ./jq -c '[index("c"), length]'
[2,3]

Seems correct assuming broken surrogates codepoints should be allowed. But I think i'm mostly concern if there is any way to produce jq strings that has a byte buffer that is not valid utf-8. If so use of jvp_utf8_decode_length might end up out-of-sync codepoint-wise or pointing outside the byte buffer.

wader avatar Aug 24 '24 20:08 wader

As @nicowilliams also expressed this is a bug https://github.com/jqlang/jq/issues/1430#issuecomment-347628258 ill merge this now

wader avatar Nov 17 '24 09:11 wader

Should the docs be updated to make it clear that it's a codepoint index, instead of byte index?

thaliaarchi avatar Nov 19 '24 08:11 thaliaarchi

Good point, i did quick skimming of the docs and it seems like we don't say it's byte offset anywhere but we don't make it that clear that it's codepoints either. So maybe mention for the index-functions and possible also under "Array/String Slice" and/or "Types and Values"? for the regexp functions we do mention things are in codepoints.

wader avatar Nov 19 '24 09:11 wader