polars icon indicating copy to clipboard operation
polars copied to clipboard

Date methods `month/week/weekday/...` should return `UInt8` type

Open stinodego opened this issue 2 years ago • 12 comments
trafficstars

There are multiple methods in the datetime space that return values that are constrained to some small range (e.g. weekday returns a value between 1 and 7). These currently return as type UInt32. This seems wasteful; a UInt8 would be sufficient.

Is there a reason for this?

stinodego avatar Jan 06 '23 10:01 stinodego

This is due to chrono's u32 return type, but I agree that this is wasteful. I am up for changing this. :+1:

ritchie46 avatar Jan 06 '23 10:01 ritchie46

Adding a reference to #4990.

mcrumiller avatar Jan 06 '23 14:01 mcrumiller

Given the troubles of unsized integers named in #4990 let's go for a Int8 as that also fits easily.

ritchie46 avatar Jan 06 '23 15:01 ritchie46

I am up for changing this.

Nice; I have our data API already typing these as downcasts automagically - not having to do so will be one more operation we can leave in polars' capable hands; little changes add up :)

alexander-beedie avatar Jan 07 '23 10:01 alexander-beedie

Looking into this now. I'll start by adjusting date methods at https://github.com/pola-rs/polars/blob/master/polars/polars-time/src/chunkedarray/date.rs#L31 and https://github.com/pola-rs/polars/blob/master/polars/polars-time/src/chunkedarray/datetime.rs#L56 , and moving from there

andrewpollack avatar Jan 16 '23 15:01 andrewpollack

I'm trying to adjust the cast_and_apply function at https://github.com/pola-rs/polars/blob/master/polars/polars-time/src/chunkedarray/datetime.rs#L10-L16 to cast the resulting type into a desired type (e.g. Int8Type).

Despite adjusting the function to take in a desired_type, I can't get the resulting type to be anything but a u32. Is there a casting related to ChunkedArray::from_chunks(ca.name(), chunks) that I'm missing?

Changes:

  • Added desired_type: &DataType, // New desired type
  • Added .to(desired_type.to_arrow())) to resulting array
fn month(&self) -> Int8Chunked {
        cast_and_apply(self.as_datetime(), temporal::month, &DataType::Int8)
        // type mismatch resolving `<polars_core::datatypes::Int8Type as polars_core::datatypes::PolarsNumericType>::Native == u32`
        // expected `i8`, found `u32`
}

fn cast_and_apply<
    F: Fn(&dyn Array) -> ArrowResult<PrimitiveArray<T::Native>>,
    T: PolarsNumericType,
>(
    ca: &DatetimeChunked,
    func: F,
    desired_type: &DataType, // New desired type
) -> ChunkedArray<T> {
    let dtype = ca.dtype().to_arrow();
    let chunks = ca
        .downcast_iter()
        .map(|arr| {
            let arr = cast(
                arr,
                &dtype,
                CastOptions {
                    wrapped: true,
                    partial: false,
                },
            )
            .unwrap();
            Box::from(func(&*arr).unwrap().to(desired_type.to_arrow())) as ArrayRef // Casting away from chronos u32 default
        })
        .collect();

    unsafe { ChunkedArray::from_chunks(ca.name(), chunks) }
}

andrewpollack avatar Jan 16 '23 17:01 andrewpollack

I don't think the function needs to be adapted. Only the closure/function that's passed to it should be different.

ritchie46 avatar Jan 16 '23 17:01 ritchie46

That makes sense! Revisiting, I'm adjusting each function passed to be of form:

fn month_i8(array: &dyn Array) -> ArrowResult<PrimitiveArray<i8>> {
  let m32 = temporal::month(array).unwrap();
  Ok(PrimitiveArray::<i8>::from_vec(m32.into_iter().map(|v| v.unwrap() as i8).collect::<Vec<_>>()))
} 

I'm looking for a cleaner way to cast our PrimitiveArray<u32> to PrimitiveArray<i8>, but I'm liking that approach a lot better

andrewpollack avatar Jan 16 '23 18:01 andrewpollack

Yes, that's alrwya better path I only think we should adapt the temporal::month kernel. As we now have two vec allocations. One in the kernel to Vec<u32> and then one later to convert to Vec<i8>.

ritchie46 avatar Jan 16 '23 18:01 ritchie46

Edit: Resolved!

Agreed that we should avoid two vec allocation. Unfortunately my Rust knowledge is preventing me from finding a cleaner conversion from the resulting temporal::month(array).unwrap() as a PrimitiveArray::<u32> into PrimitiveArray<i8> without casting, so I'm looking into that next

andrewpollack avatar Jan 16 '23 19:01 andrewpollack

@ritchie46 How should the pl.Date type be approached with this change? Specifically, the test at https://github.com/pola-rs/polars/blob/master/py-polars/tests/unit/test_series.py#L1919 will now have dt.month being of type Int8, but pl.Date evaluates by default to UInt32.

Given that pl.Date could refer to both dt.month (Int8) and dt.ordinal_day (Int16), how should I re-work pl.Date to fit the various possible types?

andrewpollack avatar Jan 16 '23 20:01 andrewpollack

I've given an update on the draft PR #6272 , but copying here in case it's actionable to others

Life circumstances have changed a bit and I do not have as much time to work on this as before. I apologize, and will try to get to this when I'm able to. Also happy to let others work on it should there be interest

andrewpollack avatar Jan 24 '23 15:01 andrewpollack