wren Add String.lower

Add String.lower

Open RobLoach opened this issue 3 years ago • 36 comments

I've found I've been missing a few string methods, so would like to clean it up to give Strings a bit more power. This is a small start, with .lower, matching JavaScript's and Java's toLowerCase() methods.

This .lower method returns a lower-case representation of the String. Would like to add .upper, but wanted to get everyone's thoughts on this first.

"AbC dEf GhI".lower // => "abc def ghi"

May 16 '21 16:05 RobLoach

There has been a lot of discussion on this and proper handling of UTF-8 but if we say "we just do what C does" as a first-step I don't think that would be entirely terrible (since many other languages only deal with ASCII also).

May 16 '21 16:05 joshgoebel

This patch is not ideal in a few ways:

the const casting to alter the content invalidate equality of the string, because the hash is created at the creation of the string.
tolower is subject to locale.
tolower operate byte array, while String are expected to be utf-8 encoded.
because utf-8 is assumed there is no guarantee that the final string has the same size than the original.

May 16 '21 17:05 mhermier

@RobLoach original signature was correct and followed wren style.

May 16 '21 17:05 mhermier

Thanks! Took on a few things...

Moved to wren code to make use of the utf8 functional codepoints
Renamed to .lowerCase to look more the other wren style functions -- Should this be .lower to match more Lua style?
Made a note in the docs that it acts on only ASCII characters

May 16 '21 19:05 RobLoach

While this is technically correct, because of the unfinished debate about utf-8, this should technically not be merge even with the limitation in the documentation.

May 16 '21 19:05 mhermier

Should this be .lower to match more Lua style?

That would be my preference. Python also uses lower() but no need for us to use the parentheses.

May 16 '21 20:05 PureFox48

@mhermier

While this is technically correct, because of the unfinished debate about utf-8, this should technically not be merge even with the limitation in the documentation.

If you're thinking here about #948, then that was really about allowable characters in identifiers and the discussion there seemed to be drifting towards maintaining the ASCII-only status quo as most of the non-English native speakers seemed comfortable with that.

I can't in any case see why we shouldn't merge this PR now as it could always be extended later if it were decided to go with some (probably table-based) approach to sort out what were lower and upper case letters within the Unicode character set as a whole.

May 17 '21 08:05 PureFox48

@PureFox48 At the end #948, talks about utf-8 and what is an upper and lower case for the vm. While I agree this could go as this, it would create a fracture in the notion of string casing. Even when documented to only work with ASCII, it would create a precedence of what is lower/upper and will be relied on.

May 17 '21 14:05 mhermier

I don't agree about ASCII. The world is Unicode, now, and although ASCII lowercase can be useful, it is much less useful than Unicode. There is a reason why most languages use Unicode in that. If this adds too much bloat, this can be thing to leave to a library.

May 17 '21 21:05 ChayimFriedman2

The world is Unicode, now,

Even if this is conceded... I'd be in favor of continuous improvement vs postponed perfection. If we agree upcase and downcase are good things to have in core and we have well-tested implementations that start with ASCII-first, then lets add those as they have a lot of utility for a lot of folks. When a well-tested PR is submitted that handles UTF-8 upcase/downcase then that can be merged at a later time.

May 17 '21 22:05 joshgoebel

Unfortunately, migrating ASCII to Unicode in this case is a breaking change. However, if we declare this from the beginning... That's probably possible.

May 17 '21 22:05 ChayimFriedman2

Unfortunately, migrating ASCII to Unicode in this case is a breaking change.

I'm really hoping someone will clear up what the official policy is on breaking changes. I feel like I've only seen people offer their opinions. To me [we can never break anything] seems a very ill-advised and harmful policy for a young project. It prevents us from moving forward in smaller steps and leads to a "postponed perfect" vs "continuous improvement" mindset. This issue is just one example of that happening.

Perhaps we aren't using semantic versioning, but my understanding is that prior to version 1.0 things can freely break... and if there are any guarantees at all it would be only against minor releases... ie 0.4 and 0.4.1 might be compatible... 0.4 and 0.5 would not need to be.

May 17 '21 23:05 joshgoebel

Unfortunately, migrating ASCII to Unicode in this case is a breaking change.

While I can understand how you could interpret that as a breaking change, I would propose that enabling UTF-8 support to .lower is an added feature. This uses Wren's codePoints, which already has support for both.

May 17 '21 23:05 RobLoach

It would be both an added feature and a breaking change for anyone depending on the initial implementations "only alters ASCII character" behavior...

May 18 '21 00:05 joshgoebel

Technically true, but how many people would use .lower and hope that it doesn't lowercase UTF-8? :upside_down_face: .... We're edge-casing at this point.

May 18 '21 00:05 RobLoach

Found a potential list for UTF-8 characters https://pastebin.com/fuw4Uizk

May 18 '21 00:05 RobLoach

Technically true, but how many people would use .lower and hope that it doesn't lowercase UTF-8?

With breaking changes one should be technical. I'm of course in favor of this improvement (even as it stands). My point was that while it is most definitely is a breaking change [in behavior], it's quite likely not a very impactful one, and that we're not even 1.0 yet - so we should be open to breaking changes if they push the project forward.

May 18 '21 00:05 joshgoebel

Found a potential list for UTF-8 characters

I think that table, which lists 1,482 letters, must be out of date as it was submitted in 2013 and Unicode has moved on quite a bit since then.

According to Wikipedia, as of Unicode 13.0, there were 1,791 upper case and 2,155 lower case letters so, assuming nearly all of the former have a lower case equivalent, it looks like we're up above 1,700 now.

May 18 '21 09:05 PureFox48

I've now found the official Unicode case charts together with some explanatory notes.

Unfortunately, the charts aren't in an ideal form for us to easily create a mapping table and there are also a few which aren't 1 to 1 which doesn't help either.

May 18 '21 10:05 PureFox48

This is too much for us to maintain for now. The easiest solution is a configuration hook and a best effort fallback with ascii-7 support and no promise on the future on the best effort.

May 18 '21 11:05 mhermier

Well, we could certainly do ISO-8859-1(5) and we might be able to go a little further than that without using a table.

But I think trying to provide complete Unicode coverage in a small, simple language such as Wren is impractical given the complexities involved and we should instead concentrate on solutions which are relatively easy to implement and will benefit the most people.

May 18 '21 11:05 PureFox48

That's a reasonable compromise. https://cs.stanford.edu/people/miles/iso8859.html

May 18 '21 13:05 RobLoach

Well the minimal requirement for the notion of casing is ascii-7 because of the capital letter naming rule. Guaranteeing more is just superfluous bonus. So even if in implementation we might do ISO-8859-1(5) I think the doc should only state about ascii-7 without extra guarantee than it might work, or use hook to provide your casing functions.

May 18 '21 13:05 mhermier

Identifiers are one thing and text handling is another.

Even if we stay with ASCII-7 for the former (because Wren uses English keywords), I don't see why text handling functions shouldn't push the boat out a little and encompass some non-English letters as well.

May 18 '21 15:05 PureFox48

Identifiers already have UTF-8 requests, so identifiers are already on the same basis as text.

The thing is that by rising the minimum guarantee, you are increasing code that is required. By letting the guarantee to a minimum, if we determine that everyone use some utf-8 library via hooks, we can drop the extra support almost silently.

May 18 '21 16:05 mhermier

Yes, but this PR is being done using Wren code and all it would take to support the basic IS0-8859-1 character set (code-points up to 255) is this:

lower {
    var output = ""
    for (c in codePoints) {
      if ((c >= 65 && c <= 90) || (c >= 192 && c <= 214) || (c >= 216 && c <= 222)) {
        c = c + 32
      }
      output = output + String.fromCodePoint(c)
    }
    return output
}

So it's not as if much extra code is needed and that might be good enough for a lot of people without needing to worry about plugging in a UTF-8 case table at all.

May 18 '21 18:05 PureFox48

It does not bring anything that an external function could do. And it makes precedence on the notion of casing. So it is still not good. It should call a builtin function that does the codePoint lower conversion, that the vm can hook to call an external lib i18n lib to provide complete lower casing of the value.

May 18 '21 19:05 mhermier

that the vm can hook to call an external lib i18n lib

Is there something special about string processing that warrants a new hook? (perhaps it is just that special, I'm not sure) I just don't feel we should get in the habit of adding hooks for every little thing... it's easy enough for someone to implement Unicode.lower and Unicode.upper with the existing C API... I'm not sure the existing String library should change it's behavior depending on a user provided hook function.

It feels to me like we should either:

Fully support Unicode upcase/downcase in Core
Don't support Unicode upcase/downcase in Core, it's a user/library concern
Realizing this is a general problem where it would be useful to extend core classes - and tackle that problem separately.

If someone wants custom behavior for String (be it upcase or downcase or other language specific quirks) perhaps they should just be able to subclass it.

May 18 '21 19:05 joshgoebel

Fully support Unicode upcase/downcase in Core

For ease of use, this means having a reference utf-8 library. Not a big deal, but can have some issue because of licensing and conflicts with users needing a duplicate version of the same library at another version...

Don't support Unicode upcase/downcase in Core, it's a user/library concern

From a maintenance stand point it is basically applying functional programming. It would restrict the linking of the external library to the module, which would probably make the things more manageable.

Realizing this is a general problem where it would be useful to extend core classes - and tackle that problem separately.

This would solve a lot of problems, but would require a ByteArray/String or String/Utf8String couple so I don't know how impact-full it could be.

I don't know the correct answer, but I don't think doing the way it is currently done in this PR is the the way to go at all. Being simple is not an excuse/justification for the adoption in core to me. It should be either not doable by other means, or full-featured (bring some value to its maximum).

May 19 '21 06:05 mhermier

Don't support Unicode upcase/downcase in Core, it's a user/library concern

Realizing this is a general problem where it would be useful to extend core classes - and tackle that problem separately.

My vote would be taking these two strategies. We recognize utf8 is important for Wren, but we want to keep Wren's core small and quick. 99% of the time when using upper/lower, the user would want what is provided here. Adding more would add a level of complexity that I think is best kept outside of Wren's core.

Being able to override String's .lower function in a Utf8 Wren module would be awesome, but in Wren's current state, you can not override internal classes right now. Should likely be taken on in a new issue. In the mean time, using an external Utf8.lower() method would work if you need it to go beyond String.lower.

May 19 '21 13:05 RobLoach

wren wren copied to clipboard

Add String.lower

wren
wren copied to clipboard