ibis
ibis copied to clipboard
feat(strings): `len_bytes` and `length` for string columns
As turned up in #10020, we get variable answers back from backends regarding the length of strings, once those strings aren't just simple ASCII.
This is due to the length of a string being sort of a fuzzy concept in the age of Unicode.
I think we should try to standardize on:
lengthreturns the count of the number of characters in the string, so e.g.len(🐍) == 1andlen(É) == 1
This may not be possible on all backends, but I think we can get close, then add a new method
len_bytes(I'm open to naming suggestions) that returns the length in bytes of the string,len_bytes(🐍) == 4andlen(É) == 2
An incomplete list of functions we can use on various backends for character length:
- Oracle:
LENGTHC - MySQL:
CHAR_LENGTH - Clickhouse:
lengthUTF8 - Impala:
CHAR_LENGTH(maybe...)
Here's a table of how our backends handle computing the length of the string 🤦♂️🤦♀️. It will surprise no one that not every backend we support agrees on this string's length.
TL; DR: consensus seems to be to count the number of code points
The code points are:
>>> s = '🤦♂️🤦♀️'
>>> code_points = [c.encode("unicode_escape") for c in s]
>>> print('\n'.join(map(bytes.decode, code_points)))
\U0001f926 # person facepalming
\u200d # zero-width joining character
\u2642 # male symbol
\ufe0f # variation selector 16
\U0001f926 # person facepalming
\u200d # zero-width joining character
\u2640 # female symbol
\ufe0f # variation selector 16
>>> len(code_points)
8
| backend | function | len('🤦♂️🤦♀️') |
notes |
|---|---|---|---|
| bigquery | CHAR_LENGTH |
8 | |
| clickhouse | lengthUTF8 |
8 | |
| datafusion | LENGTH |
8 | |
| druid | LENGTH |
10 | probably related to UTF-16 |
| duckdb | LENGTH |
8 | |
| exasol | LENGTH |
8 | |
| flink | CHAR_LENGTH |
8 | |
| impala | CHAR_LENGTH |
26 | this is the byte length |
| mssql | LEN |
6 | no idea |
| mysql | CHAR_LENGTH |
14 | byte length of non zero width chars? |
| oracle | LENGTHC |
6 | no idea |
| polars | col.str.len_chars() |
8 | |
| postgres | CHAR_LENGTH |
8 | |
| pyspark | length |
8 | |
| risingwave | CHAR_LENGTH |
8 | |
| snowflake | LENGTH |
8 | |
| sqlite | LENGTH |
8 | |
| trino | LENGTH |
8 |
The upshot here is that we can't make any strong guarantees about what is returned across backends for string operations that depend on a notion of "character", even when encoding is held constant.
This is why we can't have nice strings.
:chefs_kiss:
Yeah, no strong guarantees, but we can get the existing ones closer to each other, and then have len_bytes available if someone is specifically looking for that
Yep, sounds good to me. Slight preference for byte_length over len_bytes or length_bytes, OR deprecating length in favor of len and adding len_bytes or byte_len.
Actually, I hate byte_len, so scratch that one from my suggestion list.