polars icon indicating copy to clipboard operation
polars copied to clipboard

Lazy schema dtype error

Open s-banach opened this issue 2 years ago • 9 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

Similar to the bug I reported previously, #6643.

Reproducible example

import polars as pl

# sqrt
df = pl.DataFrame({"x": pl.Series(values=[], dtype=pl.Float32)})
correct_result = df.select(pl.col("x").sqrt()).select(pl.col(pl.Float32))
lazy_result = df.lazy().select(pl.col("x").sqrt()).select(pl.col(pl.Float32)).collect()
print(correct_result.shape == lazy_result.shape)  # False

# diff
df = pl.DataFrame({"x": pl.Series(values=[], dtype=pl.UInt8)})
correct_result = df.select(pl.col("x").diff()).select(pl.col(pl.Int16))
lazy_result = df.lazy().select(pl.col("x").diff()).select(pl.col(pl.Int16)).collect()
print(correct_result.shape == lazy_result.shape)  # False

Expected behavior

Lazy and eager give the same result.

Installed versions

---Version info---
Polars: 0.16.4
Index type: UInt32
Platform: Windows-10-10.0.19044-SP0
Python: 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 8.0.0
pandas: 1.5.2
numpy: 1.22.3
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: <not installed>

s-banach avatar Feb 12 '23 03:02 s-banach

Here are some more from the Series.computation list of exprs:

import polars as pl
from polars.datatypes import NUMERIC_DTYPES

x = pl.col("x")
funcs = [
    x.dot(x),
    x.entropy(),
    x.rolling_mean(1),
    x.rolling_quantile(1),
    x.rolling_skew(1),  # Simplifies to "rolling_apply_float()"
    x.rolling_std(1),
    x.rolling_var(1),
]
for func in funcs:
    bad_dtypes = []
    for dtype in NUMERIC_DTYPES:
        df = pl.DataFrame({"x": pl.Series(values=[1, 2, 3], dtype=dtype)})
        result_eager = df.select(func)
        dtype_eager = result_eager.get_column("x").dtype
        result_lazy = df.lazy().select(func).select(pl.col(dtype_eager)).collect()
        if not result_lazy.frame_equal(result_eager):
            bad_dtypes.append(dtype)
    print(func, bad_dtypes)
col("x").dot([col("x")]) [UInt16, UInt8, Int16, Int8]
col("x").rolling_mean() [UInt64, UInt16, UInt32, Int64, UInt8, Int32, Int16, Int8]
col("x").rolling_quantile() [UInt64, UInt16, UInt32, Int64, UInt8, Int32, Int16, Int8]
col("x").rolling_apply_float() [Float64, UInt64, UInt16, UInt32, Int64, UInt8, Int32, Int16, Int8, Float32]
col("x").rolling_std() [UInt64, UInt16, UInt32, Int64, UInt8, Int32, Int16, Int8]
col("x").rolling_var() [UInt64, UInt16, UInt32, Int64, UInt8, Int32, Int16, Int8]

s-banach avatar Feb 12 '23 15:02 s-banach

A few more from Series.aggregation. Note that product() panics for certain dtypes.

import polars as pl
from polars.datatypes import NUMERIC_DTYPES

x = pl.col("x")
funcs = [
    x.arg_max(),
    x.arg_min(),
    x.max(),
    x.mean(),
    x.median(),
    x.min(),
    x.mode(),
    x.nan_max(),
    x.nan_min(),
    # x.product(),  # Panic with Int32, UInt32, or UInt64
    x.quantile(0.5),
    x.std(),
    x.sum(),
    x.var()
]
for func in funcs:
    bad_dtypes = []
    for dtype in NUMERIC_DTYPES:
        df = pl.DataFrame({
            "x": pl.Series(values=[1, 2, 3] * 2, dtype=dtype),
            "y": pl.Series(values=["a"] * 3 + ["b"] * 3)
        })
        try:
            result_eager = df.select(func.over("y")).select("x")
            dtype_eager = result_eager["x"].dtype
            result_lazy = df.lazy().select(func.over("y")).select(pl.col(dtype_eager)).collect()
            if not result_eager.frame_equal(result_lazy):
                bad_dtypes.append(dtype)
        except:
            bad_dtypes.append(f"({dtype} -> Exception)")
    if len(bad_dtypes) > 0:
        print(func, bad_dtypes)
col("x").median() [Float32]
col("x").mode() [UInt32, Int16, '(Float64 -> Exception)', UInt64, Int8, '(Float32 -> Exception)', Int64, Int32]

s-banach avatar Feb 12 '23 15:02 s-banach

Same test for Series.arr

import polars as pl
from polars.datatypes import NUMERIC_DTYPES

x = pl.col("x")
funcs = [
    x.arr.arg_max(),
    x.arr.arg_min(),
    x.arr.concat(x),
    # x.arr.contains(3),
    x.arr.diff(1),
    # x.arr.eval(),
    x.arr.explode(),
    x.arr.first(),
    x.arr.get(0),
    x.arr.head(2),
    # x.arr.join("_"),
    x.arr.last(),
    x.arr.lengths(),
    x.arr.max(),
    x.arr.mean(),
    x.arr.min(),
    x.arr.reverse(),
    x.arr.shift(1),
    x.arr.slice(0),
    x.arr.sort(),
    x.arr.sum(),
    x.arr.tail(2),
    x.arr.take([0]),
    x.arr.to_struct(),
    x.arr.unique(),
]

for func in funcs:
    bad_dtypes = []
    for dtype in NUMERIC_DTYPES:
        df = pl.DataFrame({"x": pl.Series(values=[[1, 2, 3]], dtype=pl.List(dtype))})
        result_eager = df.select(func)
        dtype_eager = result_eager["x"].dtype
        result_lazy = df.lazy().select(func).select(pl.col(dtype_eager)).collect()
        if not result_eager.frame_equal(result_lazy):
            bad_dtypes.append(dtype)
    if len(bad_dtypes) > 0:
        print(func, bad_dtypes)
col("x").arr.diff() [UInt16, UInt64, UInt32, UInt8]
col("x").arr.sum() [UInt16, Int8, Int16, UInt8]
col("x").arr.to_struct() [UInt16, Int32, Float32, Int8, Int64, Int16, UInt64, Float64, UInt32, UInt8]

s-banach avatar Feb 12 '23 18:02 s-banach

Similar problem with Series.bin.encode

import polars as pl

df = pl.DataFrame({"x": [b"a", b"b", b"c"]})
expr = pl.col("x").bin.encode("hex")
result = df.select(expr)
dtype = result["x"].dtype  # utf8
result_lazy = df.lazy().select(expr).select(pl.col(dtype)).collect()
assert result.shape != result_lazy.shape

s-banach avatar Feb 12 '23 22:02 s-banach

This diff is a good example of what needs to be done per function. Move towards the proper FunctionExpr and ensure the schema reports the correct dtypes.

ritchie46 avatar Feb 13 '23 13:02 ritchie46

I will give it a try.

papparapa avatar Feb 14 '23 12:02 papparapa

I will give it a try.

Cool! Could you try to do only a few functions at a time? This keeps the PRs small.

ritchie46 avatar Feb 14 '23 12:02 ritchie46

I have added a checklist to keep track of the progress (please correct me if I'm missing something):

  • [ ] col("x").dot([col("x")]) [UInt16, UInt8, Int16, Int8]
  • [ ] col("x").rolling_mean() [UInt64, UInt16, UInt32, Int64, UInt8, Int32, Int16, Int8]
  • [ ] col("x").rolling_quantile() [UInt64, UInt16, UInt32, Int64, UInt8, Int32, Int16, Int8]
  • [ ] col("x").rolling_apply_float() [Float64, UInt64, UInt16, UInt32, Int64, UInt8, Int32, Int16, Int8, Float32]
  • [ ] col("x").rolling_std() [UInt64, UInt16, UInt32, Int64, UInt8, Int32, Int16, Int8]
  • [ ] col("x").rolling_var() [UInt64, UInt16, UInt32, Int64, UInt8, Int32, Int16, Int8]
  • [ ] col("x").median() [Float32]
  • [ ] col("x").mode() [UInt32, Int16, '(Float64 -> Exception)', UInt64, Int8, '(Float32 -> Exception)', Int64, Int32]
  • [ ] col("x").product() ['(Int32 -> Exception)', '(UInt32 -> Exception)', '(UInt64 -> Exception)']
  • [ ] col("x").arr.diff() [UInt16, UInt64, UInt32, UInt8]
  • [ ] col("x").arr.sum() [UInt16, Int8, Int16, UInt8]
  • [ ] col("x").arr.to_struct() [UInt16, Int32, Float32, Int8, Int64, Int16, UInt64, Float64, UInt32, UInt8]
  • [ ] col("x").bin.encode("hex") [utf8]

mhattingpete avatar Feb 19 '23 07:02 mhattingpete

@ritchie46 is this still relevant?

I think this should be small enough for me to start as a first issue on. However, the codebase has changed quite a bit since the example you provided, so im a bit in the dark so if you could give me some pointers on where to start and what the desired types are, that would be great.

romanovacca avatar Sep 12 '23 09:09 romanovacca

The reproducible example in the original issue has been fixed. There have been other mentions in this issue, but since it is quite outdated, I think it's better to close this as it's not clear what is still relevant. If there are still unresolved problems here, please open a new issue!

stinodego avatar Jan 18 '24 21:01 stinodego