Ritchie Vink
Ritchie Vink
Could it be that pyarrow convert the categorical upon reading? Whereas we first read as string column and then convert.
Yeap, that makes more sense to me as the local builders seem pretty optimized.
Yes, you need to hash those strings and store them in a hashmap. That's expensive. ```python >>> %%time >>> df["pickup_datetime"].cast(pl.Categorical) CPU times: user 1.91 s, sys: 162 ms, total: 2.08...
Was closed by the wrong PR. > Global string cache is way faster now for the case above (after https://github.com/pola-rs/polars/pull/4087): Wow there is almost no overhead of the global string...
#3313 only fixed the first function. I still need to do the latest.
This would create a redundancy and would create differences in how users would write polars queries, which I want to keep to a minimum. I think I will even follow...
> I dont really think that writing queries different ways would be more of an issue than it already is, after all i can already use the dunders directly to...
Given the many requests for this, I am willing to accept a PR that implements those on the expressions.
Something like this?: ```python from pprint import pprint pprint(df.schema) ``` ``` {'dropoff_datetime': , 'dropoff_latitude': , 'dropoff_longitude': , 'fare_amount': , 'mta_tax': , 'passenger_count': , 'payment_type': , 'pickup_datetime': , 'pickup_latitude': , 'pickup_longitude':...
Right, so we should improve the print of our nested structures!