stdlib More string trimming capabilities

trafficstars

The existing string.trim, string.trim_left and string.trim_right functions currently only trim whitespace.

Could we add equivalent trimming functions that take the codepoints to be trimmed as an input, similar to Erlang's string.trim/3?

E.g.

import gleam/list

/// Removes all occurrences of the specified codepoints from the start and
/// end of a `String`.
///
pub fn trim_codepoints(s: String, codepoints: List(UtfCodepoint)) -> String {
  string_trim(s, Both, codepoints_to_ints(codepoints))
}

/// Removes all occurrences of the specified codepoints from the start of a
/// `String`.
///
pub fn trim_codepoints_left(s: String, codepoints: List(UtfCodepoint)) -> String {
  string_trim(s, Leading, codepoints_to_ints(codepoints))
}

/// Removes all occurrences of the specified codepoints from the end of a
/// `String`.
///
pub fn trim_codepoints_right(
  s: String,
  codepoints: List(UtfCodepoint),
) -> String {
  string_trim(s, Trailing, codepoints_to_ints(codepoints))
}

fn codepoints_to_ints(codepoints: List(UtfCodepoint)) {
  codepoints
  |> list.map(string.utf_codepoint_to_int)
}

type Direction {
  Leading
  Trailing
  Both
}

@target(erlang)
@external(erlang, "string", "trim")
fn string_trim(a: String, b: Direction, c: List(Int)) -> String

There are other ways to slice this, such as taking a String as the final parameter and using its codepoints as the ones to trim, rather than a List(UtfCodepoint).

I haven't looked into what the JavaScript implementation would be, but can do so.

On a related note, my understanding is that the existing methods with the words "left" and "right" in them assume left-to-right reading order, which doesn't hold for all languages (e.g. Arabic, Hebrew). Could consider using "start" and "end" instead, or adding these as aliases.

May 11 '24 03:05 richard-viney

Sounds good, but we'll need to think of some other name as the ones used in the code example don't fit with the convention used in this repo.

Changing left and right for start and end sounds good.

May 13 '24 13:05 lpil

What names would you suggest?

Exposing the existing Direction type would mean only one public function is added:


@target(erlang)
pub fn trim_codepoints(
  string: String,
  direction: Direction,
  codepoints: String,
) -> String {
  let codepoint_ints =
    codepoints
    |> to_utf_codepoints
    |> list.map(utf_codepoint_to_int)

  do_trim_codepoints(string, direction, codepoint_ints)
}

@target(erlang)
@external(erlang, "string", "trim")
pub fn do_trim_codepoints(
  string: String,
  direction: Direction,
  codepoints: List(Int),
) -> String

@target(javascript)
pub fn do_trim_codepoints(
  string: String,
  direction: Direction,
  codepoints: List(Int),
) -> String {
  todo
}

May 20 '24 01:05 richard-viney

We will not expose the direction type, it needs to be different functions.

May 20 '24 13:05 lpil

maybe one of these

pub fn trim_codepoints(
  string: String
) { todo }


// and any pair:


pub fn trim_start_codepoints(
  string: String,
  by: String,
) { todo }

pub fn trim_end_codepoints(
  string: String,
  by: String,
) { todo }


pub fn trim_head_codepoints(
  string: String,
  by: String,
) { todo }

pub fn trim_tail_codepoints(
  string: String,
  by: String,
) { todo }


pub fn trim_beginning_codepoints(
  string: String,
  by: String,
) { todo }

pub fn trim_ending_codepoints(
  string: String,
  by: String,
) { todo }

Jul 18 '24 19:07 inoas

Has anyone started working on this?

I'm new to gleam but I'd like to try implementing this.

pub fn trim_codepoints_start(str: String, by: String) -> String {
  case starts_with(str, by) {
    True -> drop_left(str, length(by)) |> trim_codepoints_start(by)
    False -> str
  }
}

The above implementation seems to work, but I'm pretty sure there's a more performant way to do this, I'm just not used to thinking in pure functional terms.

Aug 31 '24 16:08 RobiFerentz

Yep I'd still like to get this through. We need to:

Agree on the naming for the functions.
Do the JavaScript implementation. I'm happy to take a look at this.

The code below implements the three proposed trim functions for the Erlang target via string:trim/3. JavaScript implementation still pending as mentioned above.

import gleam/list
import gleam/string

/// Removes all occurrences of the specified codepoints from the start and
/// end of a `String`.
///
pub fn trim_codepoints(string: String, codepoints: String) -> String {
  let codepoint_ints = codepoints |> string_to_codepoint_ints

  string_trim(string, Both, codepoint_ints)
}

/// Removes all occurrences of the specified codepoints from the start of a
/// `String`.
///
pub fn trim_codepoints_start(string: String, codepoints: String) -> String {
  let codepoint_ints = codepoints |> string_to_codepoint_ints

  string_trim(string, Leading, codepoint_ints)
}

/// Removes all occurrences of the specified codepoints from the end of a
/// `String`.
///
pub fn trim_codepoints_end(string: String, codepoints: String) -> String {
  let codepoint_ints = codepoints |> string_to_codepoint_ints

  string_trim(string, Trailing, codepoint_ints)
}

fn string_to_codepoint_ints(string: String) -> List(Int) {
  string
  |> string.to_utf_codepoints
  |> list.map(string.utf_codepoint_to_int)
}

type Direction {
  Leading
  Trailing
  Both
}

@target(erlang)
@external(erlang, "string", "trim")
fn string_trim(string: String, dir: Direction, characters: List(Int)) -> String

Sep 01 '24 06:09 richard-viney

Are we sure we want to be trimming codepoints and not graphemes? How would we ensure that the string is still valid unicode when removing codepoints?

Sep 01 '24 12:09 lpil

Looking at a few other languages, Erlang's string:trim/3 and Rust's string.trim_end_matches both operate on codepoints, i.e. they allow invalid graphemes to be created by the trimming operation. JavaScript doesn't have an equivalent configurable trim, but if you replace a specific codepoint with "" using a regex you get the same end result, i.e. regex replace doesn't respect graphemes.

I had previously thought that Gleam strings were defined to be valid UTF-8 binaries, but hadn't realised this extended to also being valid Unicode graphemes. Is that the case? If so, wouldn't the following lines need to error because they create a string with an invalid grapheme? (U+0301 is the 'Combining Acute Accent' as it modifies the preceding character):

let invalid_grapheme = "\u{0301}"
let invalid_grapheme = <<0xCC, 0x81>> |> bit_array.to_string

I tried a couple of other (what I believe to be) invalid graphemes and none were rejected from being turned into Gleam strings.

Either way, this trimming functionality takes a List(String) of graphemes to be trimmed off the input string, but if String isn't guaranteed to only contain valid graphemes then the result of the trim can't strictly be guaranteed to either.

This would look something like:

/// Removes all occurrences of the specified graphemes from the start and
/// end of a `String`.
///
pub fn trim_graphemes(string: String, graphemes: List(String)) -> String {
  todo
}

/// Removes all occurrences of the specified graphemes from the start of a
/// `String`.
///
pub fn trim_graphemes_start(string: String, graphemes: List(String)) -> String {
  todo
}

/// Removes all occurrences of the specified graphemes from the end of a
/// `String`.
///
pub fn trim_graphemes_end(string: String, graphemes: List(String)) -> String {
  todo
}

Sep 08 '24 02:09 richard-viney

I don't really understand what the best behaviour is here. Is there a use case that motivates making it easy to have invalid graphemes?

Sep 10 '24 15:09 lpil

No specific use case that I'm aware of. In practice most trimming is probably not done using codepoints that can be part of multi-codepoint graphemes, hence why other languages tend to trim just based on codepoints.

Maybe it's better to make the graphemes argument a String instead of a List(String):

pub fn trim_graphemes(string: String, graphemes: String) -> String
pub fn trim_graphemes_start(string: String, graphemes: String) -> String
pub fn trim_graphemes_end(string: String, graphemes: String) -> String

The implementation then calls string.to_graphemes() on the graphemes argument and trims using that.

Sep 11 '24 02:09 richard-viney

Some alternative names:

pub fn trim_with(string: String, graphemes: String) -> String
pub fn trim_start_with(string: String, graphemes: String) -> String
pub fn trim_end_with(string: String, graphemes: String) -> String

pub fn trim_using(string: String, graphemes: String) -> String
pub fn trim_start_using(string: String, graphemes: String) -> String
pub fn trim_end_using(string: String, graphemes: String) -> String

Or could copy the Rust naming:

pub fn trim_matches(string: String, graphemes: String) -> String
pub fn trim_start_matches(string: String, graphemes: String) -> String
pub fn trim_end_matches(string: String, graphemes: String) -> String

Sep 11 '24 04:09 richard-viney

I've been poking at this, and just to throw a name into the ring, I've been liking trim_chars; I feel like trim_matches implies it takes a regexp, but that's just me. I'm definitely flexible on the name.

I assume we'd want to deprecate trim_left and trim_right and create aliases for consistency?

Oct 06 '24 16:10 ollien

Thanks for looking into this.

I don't know if introducing the term 'char' is desirable when 'codepoint' and 'grapheme' are already in use and are well-defined?

Oct 06 '24 22:10 richard-viney

Yeah that's reasonable. I think I'll take with, if that's ok with you!

Oct 06 '24 23:10 ollien

I had posted a PR before having read this issue thoroughly (I had been working on it before I found the issue actually), so I missed a couple things here! Apologies for not addressing the concerns more thoroughly, first.

I think trimming on graphemes is fine, but perhaps a bit of a challenge. I suspect the erlang implementation will not respect this (not at a PC to check). Whether or not that's worth the complexity of having to re-implement the behavior in the stdlib, I'm not sure. It definitely feels like a nice behavior, though.
I propose we take on the name trim_start_with / trim_end_with. IMO it's the clearest of the suggested names so far, but I don't care too much necessarily
It's been suggested implicitly, but I'd like to propose more formally that these two functions are shipped with a corresponding trim_with, that handles bidirectional trimming. It seems odd to not include it, but to include the left/right variants.
This, I feel less strongly about, but to echo my question above, we might consider deprecating trim_left/trim_right, and rename them to trim_start/trim_end` for consistency. Not sure how y'all feel about this though.

Oct 07 '24 19:10 ollien

Those names are not acceptable in the stdlib as that's not the convention it uses.

We will be deprecating trim left and right.

Oct 07 '24 20:10 lpil

@lpil is there a name you'd prefer? I'm happy to use whatever name you think is conventional.

Oct 07 '24 20:10 ollien

I do not have any firm ideas for a name, but I am also not convinced that these functions are worthwhile additions.

Oct 07 '24 20:10 lpil

I'll admit, it's definitely more thorny than I thought. Originally I assumed this was so uncontroversial I'd take it upon myself to implement 😅. Text encoding is always hairy.

I do think that this is a common enough need to warrant its inclusion in the standard library. I can't say I use this functionality commonly, but it is very helpful when I need it (usually some kind of parsing case).

All of that said, I'm happy to follow your lead on this. If there's something you need to help move this forward, I'm happy to do what I can. I also understand the burden of needing to maintain what outside contributors submit, so I would also understand if this juice wasn't worth the squeeze.

I'll think on names, for now; or maybe someone else has one that will stick.

Oct 07 '24 20:10 ollien

stdlib stdlib copied to clipboard

More string trimming capabilities

stdlib
stdlib copied to clipboard