ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat(strings): `len_bytes` and `length` for string columns

Open gforsyth opened this issue 1 year ago • 6 comments

As turned up in #10020, we get variable answers back from backends regarding the length of strings, once those strings aren't just simple ASCII.

This is due to the length of a string being sort of a fuzzy concept in the age of Unicode.

I think we should try to standardize on:

  • length returns the count of the number of characters in the string, so e.g. len(🐍) == 1 and len(É) == 1

This may not be possible on all backends, but I think we can get close, then add a new method

  • len_bytes (I'm open to naming suggestions) that returns the length in bytes of the string, len_bytes(🐍) == 4 and len(É) == 2

An incomplete list of functions we can use on various backends for character length:

  • Oracle: LENGTHC
  • MySQL: CHAR_LENGTH
  • Clickhouse: lengthUTF8
  • Impala: CHAR_LENGTH (maybe...)

gforsyth avatar Sep 05 '24 18:09 gforsyth

Here's a table of how our backends handle computing the length of the string 🤦‍♂️🤦‍♀️. It will surprise no one that not every backend we support agrees on this string's length.

TL; DR: consensus seems to be to count the number of code points

The code points are:

>>> s = '🤦‍♂️🤦‍♀️'
>>> code_points = [c.encode("unicode_escape") for c in s]
>>> print('\n'.join(map(bytes.decode, code_points)))
\U0001f926  # person facepalming
\u200d      # zero-width joining character
\u2642      # male symbol
\ufe0f      # variation selector 16
\U0001f926  # person facepalming
\u200d      # zero-width joining character
\u2640      # female symbol
\ufe0f      # variation selector 16
>>> len(code_points)
8
backend function len('🤦‍♂️🤦‍♀️') notes
bigquery CHAR_LENGTH 8
clickhouse lengthUTF8 8
datafusion LENGTH 8
druid LENGTH 10 probably related to UTF-16
duckdb LENGTH 8
exasol LENGTH 8
flink CHAR_LENGTH 8
impala CHAR_LENGTH 26 this is the byte length
mssql LEN 6 no idea
mysql CHAR_LENGTH 14 byte length of non zero width chars?
oracle LENGTHC 6 no idea
polars col.str.len_chars() 8
postgres CHAR_LENGTH 8
pyspark length 8
risingwave CHAR_LENGTH 8
snowflake LENGTH 8
sqlite LENGTH 8
trino LENGTH 8

cpcloud avatar Sep 06 '24 11:09 cpcloud

The upshot here is that we can't make any strong guarantees about what is returned across backends for string operations that depend on a notion of "character", even when encoding is held constant.

This is why we can't have nice strings.

cpcloud avatar Sep 06 '24 18:09 cpcloud

:chefs_kiss:

gforsyth avatar Sep 06 '24 19:09 gforsyth

Yeah, no strong guarantees, but we can get the existing ones closer to each other, and then have len_bytes available if someone is specifically looking for that

gforsyth avatar Sep 06 '24 19:09 gforsyth

Yep, sounds good to me. Slight preference for byte_length over len_bytes or length_bytes, OR deprecating length in favor of len and adding len_bytes or byte_len.

cpcloud avatar Sep 06 '24 19:09 cpcloud

Actually, I hate byte_len, so scratch that one from my suggestion list.

cpcloud avatar Sep 06 '24 19:09 cpcloud