polars
polars copied to clipboard
Date methods `month/week/weekday/...` should return `UInt8` type
There are multiple methods in the datetime space that return values that are constrained to some small range (e.g. weekday returns a value between 1 and 7). These currently return as type UInt32. This seems wasteful; a UInt8 would be sufficient.
Is there a reason for this?
This is due to chrono's u32 return type, but I agree that this is wasteful. I am up for changing this. :+1:
Adding a reference to #4990.
Given the troubles of unsized integers named in #4990 let's go for a Int8 as that also fits easily.
I am up for changing this.
Nice; I have our data API already typing these as downcasts automagically - not having to do so will be one more operation we can leave in polars' capable hands; little changes add up :)
Looking into this now. I'll start by adjusting date methods at https://github.com/pola-rs/polars/blob/master/polars/polars-time/src/chunkedarray/date.rs#L31 and https://github.com/pola-rs/polars/blob/master/polars/polars-time/src/chunkedarray/datetime.rs#L56 , and moving from there
I'm trying to adjust the cast_and_apply function at https://github.com/pola-rs/polars/blob/master/polars/polars-time/src/chunkedarray/datetime.rs#L10-L16 to cast the resulting type into a desired type (e.g. Int8Type).
Despite adjusting the function to take in a desired_type, I can't get the resulting type to be anything but a u32. Is there a casting related to ChunkedArray::from_chunks(ca.name(), chunks) that I'm missing?
Changes:
- Added
desired_type: &DataType, // New desired type - Added
.to(desired_type.to_arrow()))to resulting array
fn month(&self) -> Int8Chunked {
cast_and_apply(self.as_datetime(), temporal::month, &DataType::Int8)
// type mismatch resolving `<polars_core::datatypes::Int8Type as polars_core::datatypes::PolarsNumericType>::Native == u32`
// expected `i8`, found `u32`
}
fn cast_and_apply<
F: Fn(&dyn Array) -> ArrowResult<PrimitiveArray<T::Native>>,
T: PolarsNumericType,
>(
ca: &DatetimeChunked,
func: F,
desired_type: &DataType, // New desired type
) -> ChunkedArray<T> {
let dtype = ca.dtype().to_arrow();
let chunks = ca
.downcast_iter()
.map(|arr| {
let arr = cast(
arr,
&dtype,
CastOptions {
wrapped: true,
partial: false,
},
)
.unwrap();
Box::from(func(&*arr).unwrap().to(desired_type.to_arrow())) as ArrayRef // Casting away from chronos u32 default
})
.collect();
unsafe { ChunkedArray::from_chunks(ca.name(), chunks) }
}
I don't think the function needs to be adapted. Only the closure/function that's passed to it should be different.
That makes sense! Revisiting, I'm adjusting each function passed to be of form:
fn month_i8(array: &dyn Array) -> ArrowResult<PrimitiveArray<i8>> {
let m32 = temporal::month(array).unwrap();
Ok(PrimitiveArray::<i8>::from_vec(m32.into_iter().map(|v| v.unwrap() as i8).collect::<Vec<_>>()))
}
I'm looking for a cleaner way to cast our PrimitiveArray<u32> to PrimitiveArray<i8>, but I'm liking that approach a lot better
Yes, that's alrwya better path
I only think we should adapt the temporal::month kernel. As we now have two vec allocations. One in the kernel to Vec<u32> and then one later to convert to Vec<i8>.
Edit: Resolved!
Agreed that we should avoid two vec allocation. Unfortunately my Rust knowledge is preventing me from finding a cleaner conversion from the resulting temporal::month(array).unwrap() as a PrimitiveArray::<u32> into PrimitiveArray<i8> without casting, so I'm looking into that next
@ritchie46 How should the pl.Date type be approached with this change? Specifically, the test at https://github.com/pola-rs/polars/blob/master/py-polars/tests/unit/test_series.py#L1919 will now have dt.month being of type Int8, but pl.Date evaluates by default to UInt32.
Given that pl.Date could refer to both dt.month (Int8) and dt.ordinal_day (Int16), how should I re-work pl.Date to fit the various possible types?
I've given an update on the draft PR #6272 , but copying here in case it's actionable to others
Life circumstances have changed a bit and I do not have as much time to work on this as before. I apologize, and will try to get to this when I'm able to. Also happy to let others work on it should there be interest