polars icon indicating copy to clipboard operation
polars copied to clipboard

Extend `Expr.reinterpret` to arbitrary primitive numeric types

Open knl opened this issue 1 year ago • 6 comments

Description

Clickhouse has a rich set of reinterpretAs* type casting functions that are helpful when dealing with dirty data: https://clickhouse.com/docs/en/sql-reference/functions/type-conversion-functions.

These are really useful when data is written in one format, but it was actually intended to be used another format. Case in point: I have a parquet file with a column of UInt64 values. Unfortunately, this data is actually a Float64 value. The writer (which I can't change), just took the float value as 8 bytes and interpreted that as a int.

Here is an example of what's happening:

import polars as pl

data = {"nums": [4837362400224322580,4837362400224322582,4837362400224322584]}
schema = {"nums": pl.UInt64}

xdf = pl.DataFrame(data, schema)
print(xdf)

which gets me:

shape: (3, 1)
┌─────────────────────┐
│ nums                │
│ ---                 │
│ u64                 │
╞═════════════════════╡
│ 4837362400224322580 │
│ 4837362400224322582 │
│ 4837362400224322584 │
└─────────────────────┘

To do the proper conversion, I have to use this (rather inefficient):

import struct
import polars as pl

def long_bits_to_double(x: int) -> float:
    """Convert feed_nse_tbt order id representation to orders_nse_nnf order id representation."""
    bts = struct.pack('<Q', x)
    d, = struct.unpack('d', bts)
    return d


data = {"nums": [4837362400224322580,4837362400224322582,4837362400224322584]}
schema = {"nums": pl.UInt64}

xdf = (
    pl.DataFrame(data, schema)
    .with_columns(
        naive_wrong=pl.col('nums').cast(pl.Float64),
        slow_right=pl.col('nums').map_elements(long_bits_to_double),
    )
)
with pl.Config(fmt_float="full"):
    print(xdf)

which yields:

shape: (3, 3)
┌─────────────────────┬─────────────────────┬──────────────────┐
│ nums                ┆ naive_wrong         ┆ slow_right       │
│ ---                 ┆ ---                 ┆ ---              │
│ u64                 ┆ f64                 ┆ f64              │
╞═════════════════════╪═════════════════════╪══════════════════╡
│ 4837362400224322580 ┆ 4837362400224323000 ┆ 2500000027890186 │
│ 4837362400224322582 ┆ 4837362400224323000 ┆ 2500000027890187 │
│ 4837362400224322584 ┆ 4837362400224323000 ┆ 2500000027890188 │
└─────────────────────┴─────────────────────┴──────────────────┘

knl avatar Jan 12 '24 08:01 knl

We already have Series.reinterpret but it's limited to u64/i64. Perhaps it can be extended?

mcrumiller avatar Jan 12 '24 13:01 mcrumiller

I would be in favour of extending Series.reinterpret/Expr.reinterpret to support an arbitrary dtype argument. The only requirement being that the size of the two types is the same (e.g. no reinterpreting u16s as i32s) and that the types are primitive numeric types.

orlp avatar Jan 12 '24 14:01 orlp

I took the liberty of editing your title.

orlp avatar Jan 12 '24 14:01 orlp

@orlp when you say "the size of the two types is the same" do you mean "the size of the two types is compatible," i.e. can we interperet i64 as i32 with a series double the length? I realize that may be incompatible with the validity bitmap, but a quick and obvious fix would be in that case the duplicate each bit in the bitmap, i.e. 0100 => 00110000.

I rather like being able to view bytes via u8 views in numpy. We can currently cast to pl.Binary which is similar but not always quite the same, as it tend to strip leading 0 bytes:

>>> pl.Series([1], dtype=pl.UInt64).cast(pl.Binary)[0]
b'1'

mcrumiller avatar Jan 12 '24 14:01 mcrumiller

@orlp when you say "the size of the two types is the same" do you mean "the size of the two types is compatible," i.e. can we interperet i64 as i32 with a series double the length? I realize that may be incompatible with the validity bitmap, but a quick and obvious fix would be in that case the duplicate each bit in the bitmap, i.e. 0100 => 00110000.

I would suggest we avoid such approach and with reinterpret focus only on 1:1 mappings. Clickhouse has C++ background and reinterpretAs* functions do casting at the bit level.

knl avatar Jan 12 '24 15:01 knl

^ I have something that simply appends on logic for the Int and Float 32/64 types. It looks like the current impl of reinterpret was added before the ability to reinterpret series beyond the 32bit and 64bit primitives was supported in polars, so not sure if a larger revamp of reinterpret is desired since ostensibly it could support more data types now. Also not sure about what the desired API is, I just added an optional param int: bool = True for determining whether you want to cast to int or float.

@orlp do you have an opinion on the above impl?

collinprince avatar Jan 13 '24 07:01 collinprince