ibis feat(strings): `len_bytes` and `length` for string columns

As turned up in #10020, we get variable answers back from backends regarding the length of strings, once those strings aren't just simple ASCII.

This is due to the length of a string being sort of a fuzzy concept in the age of Unicode.

I think we should try to standardize on:

length returns the count of the number of characters in the string, so e.g. len(🐍) == 1 and len(É) == 1

This may not be possible on all backends, but I think we can get close, then add a new method

len_bytes (I'm open to naming suggestions) that returns the length in bytes of the string, len_bytes(🐍) == 4 and len(É) == 2

An incomplete list of functions we can use on various backends for character length:

Oracle: LENGTHC
MySQL: CHAR_LENGTH
Clickhouse: lengthUTF8
Impala: CHAR_LENGTH (maybe...)

Sep 05 '24 18:09 gforsyth

Here's a table of how our backends handle computing the length of the string 🤦‍♂️🤦‍♀️. It will surprise no one that not every backend we support agrees on this string's length.

TL; DR: consensus seems to be to count the number of code points

The code points are:

>>> s = '🤦‍♂️🤦‍♀️'
>>> code_points = [c.encode("unicode_escape") for c in s]
>>> print('\n'.join(map(bytes.decode, code_points)))
\U0001f926  # person facepalming
\u200d      # zero-width joining character
\u2642      # male symbol
\ufe0f      # variation selector 16
\U0001f926  # person facepalming
\u200d      # zero-width joining character
\u2640      # female symbol
\ufe0f      # variation selector 16
>>> len(code_points)
8

backend	function	`len('🤦‍♂️🤦‍♀️')`	notes
bigquery	`CHAR_LENGTH`	8
clickhouse	`lengthUTF8`	8
datafusion	`LENGTH`	8
druid	`LENGTH`	10	probably related to UTF-16
duckdb	`LENGTH`	8
exasol	`LENGTH`	8
flink	`CHAR_LENGTH`	8
impala	`CHAR_LENGTH`	26	this is the byte length
mssql	`LEN`	6	no idea
mysql	`CHAR_LENGTH`	14	byte length of non zero width chars?
oracle	`LENGTHC`	6	no idea
polars	`col.str.len_chars()`	8
postgres	`CHAR_LENGTH`	8
pyspark	`length`	8
risingwave	`CHAR_LENGTH`	8
snowflake	`LENGTH`	8
sqlite	`LENGTH`	8
trino	`LENGTH`	8

Sep 06 '24 11:09 cpcloud

The upshot here is that we can't make any strong guarantees about what is returned across backends for string operations that depend on a notion of "character", even when encoding is held constant.

This is why we can't have nice strings.

Sep 06 '24 18:09 cpcloud

:chefs_kiss:

Sep 06 '24 19:09 gforsyth

Yeah, no strong guarantees, but we can get the existing ones closer to each other, and then have len_bytes available if someone is specifically looking for that

Sep 06 '24 19:09 gforsyth

Yep, sounds good to me. Slight preference for byte_length over len_bytes or length_bytes, OR deprecating length in favor of len and adding len_bytes or byte_len.

Sep 06 '24 19:09 cpcloud

Actually, I hate byte_len, so scratch that one from my suggestion list.

Sep 06 '24 19:09 cpcloud

ibis ibis copied to clipboard

feat(strings): `len_bytes` and `length` for string columns

ibis
ibis copied to clipboard