Extend `Expr.reinterpret` to arbitrary primitive numeric types
Description
Clickhouse has a rich set of reinterpretAs* type casting functions that are helpful when dealing with dirty data: https://clickhouse.com/docs/en/sql-reference/functions/type-conversion-functions.
These are really useful when data is written in one format, but it was actually intended to be used another format. Case in point: I have a parquet file with a column of UInt64 values. Unfortunately, this data is actually a Float64 value. The writer (which I can't change), just took the float value as 8 bytes and interpreted that as a int.
Here is an example of what's happening:
import polars as pl
data = {"nums": [4837362400224322580,4837362400224322582,4837362400224322584]}
schema = {"nums": pl.UInt64}
xdf = pl.DataFrame(data, schema)
print(xdf)
which gets me:
shape: (3, 1)
┌─────────────────────┐
│ nums │
│ --- │
│ u64 │
╞═════════════════════╡
│ 4837362400224322580 │
│ 4837362400224322582 │
│ 4837362400224322584 │
└─────────────────────┘
To do the proper conversion, I have to use this (rather inefficient):
import struct
import polars as pl
def long_bits_to_double(x: int) -> float:
"""Convert feed_nse_tbt order id representation to orders_nse_nnf order id representation."""
bts = struct.pack('<Q', x)
d, = struct.unpack('d', bts)
return d
data = {"nums": [4837362400224322580,4837362400224322582,4837362400224322584]}
schema = {"nums": pl.UInt64}
xdf = (
pl.DataFrame(data, schema)
.with_columns(
naive_wrong=pl.col('nums').cast(pl.Float64),
slow_right=pl.col('nums').map_elements(long_bits_to_double),
)
)
with pl.Config(fmt_float="full"):
print(xdf)
which yields:
shape: (3, 3)
┌─────────────────────┬─────────────────────┬──────────────────┐
│ nums ┆ naive_wrong ┆ slow_right │
│ --- ┆ --- ┆ --- │
│ u64 ┆ f64 ┆ f64 │
╞═════════════════════╪═════════════════════╪══════════════════╡
│ 4837362400224322580 ┆ 4837362400224323000 ┆ 2500000027890186 │
│ 4837362400224322582 ┆ 4837362400224323000 ┆ 2500000027890187 │
│ 4837362400224322584 ┆ 4837362400224323000 ┆ 2500000027890188 │
└─────────────────────┴─────────────────────┴──────────────────┘
We already have Series.reinterpret but it's limited to u64/i64. Perhaps it can be extended?
I would be in favour of extending Series.reinterpret/Expr.reinterpret to support an arbitrary dtype argument. The only requirement being that the size of the two types is the same (e.g. no reinterpreting u16s as i32s) and that the types are primitive numeric types.
I took the liberty of editing your title.
@orlp when you say "the size of the two types is the same" do you mean "the size of the two types is compatible," i.e. can we interperet i64 as i32 with a series double the length? I realize that may be incompatible with the validity bitmap, but a quick and obvious fix would be in that case the duplicate each bit in the bitmap, i.e. 0100 => 00110000.
I rather like being able to view bytes via u8 views in numpy. We can currently cast to pl.Binary which is similar but not always quite the same, as it tend to strip leading 0 bytes:
>>> pl.Series([1], dtype=pl.UInt64).cast(pl.Binary)[0]
b'1'
@orlp when you say "the size of the two types is the same" do you mean "the size of the two types is compatible," i.e. can we interperet
i64asi32with a series double the length? I realize that may be incompatible with the validity bitmap, but a quick and obvious fix would be in that case the duplicate each bit in the bitmap, i.e.0100=>00110000.
I would suggest we avoid such approach and with reinterpret focus only on 1:1 mappings. Clickhouse has C++ background and reinterpretAs* functions do casting at the bit level.
^ I have something that simply appends on logic for the Int and Float 32/64 types. It looks like the current impl of reinterpret was added before the ability to reinterpret series beyond the 32bit and 64bit primitives was supported in polars, so not sure if a larger revamp of reinterpret is desired since ostensibly it could support more data types now. Also not sure about what the desired API is, I just added an optional param int: bool = True for determining whether you want to cast to int or float.
@orlp do you have an opinion on the above impl?