find_column_type takes too long for large dataset
Trying to parse a string to timestamp takes very long when the dataframe is large with many rows for actual string columns that are not in any timestamp format.
Possible solutions:
- Thinking if we could leave it to the user to parse it explicitly in their original dataframe before uploading to snowflake? Technically, Polars users should have clear distinction since polars only has separate string column type and temporal dtypes, unlike the object dtype in pandas which could include pretty much everything.
- Use a sample of the records instead of everything
- Remove trying to cast "str" to any other datatypes. Technically, an object type shouldn't exist in polars DataFrame. Even if it did, our current
validate_X_objecthelper functions for polars do not cover them as polars just doesn't have any functions for object type. Our current helper functions only cast from string to date/float, which is different from the behaviour for a pandas dataframe.
Justification for 3. Small example:
import numpy as np
import pandas as pd
arr = np.array([1, "hello", 3.14, [1, 2, 3], {"a": 1}], dtype=object)
date = pd.date_range(start='1/1/2018', end='1/05/2018')
pd_df=pd.DataFrame({"mixed":arr,"date": date})
pd_df["date_str"] = pd_df["date"].dt.strftime('%Y-%m-%d').astype('string')
pd_df["date_object"] = pd_df["date"].dt.strftime('%Y-%m-%d')
pd_df["string"] = ["a","b","c","d","e"]
pd_df.dtypes
gives
mixed object
date datetime64[ns]
date_str string[python]
date_object object
string object
dtype: object
import locopy
from locopy.utility import find_column_type
find_column_type(pd_df, "snowflake")
outputs: OrderedDict([('mixed', 'varchar'), ('date', 'timestamp'), ('date_str', 'varchar'), ('date_object', 'date'), ('string', 'varchar')])
On the other hand for polars (see the difference in "date_str")
pl_df=pl.from_pandas(pd_df[["date","date_str","date_object","string"]])
pl_df=pl_df.with_columns(mixed =arr) #adding explicitly otherwise it will error out if we use an object type in from_pandas
find_column_type(pl_df,"snowflake")
outputs: OrderedDict([('date', 'timestamp'), ('date_str', 'date'), ('date_object', 'date'), ('string', 'varchar'), ('mixed', 'varchar')])
polars also automatically try to infer datatypes
eg. pl_df["date_object"] is automatically changed to "string"
shape: (5, 5)
| date | date_str | date_object | string | mixed |
|---|---|---|---|---|
| datetime[ns] | str | str | str | object |
| 2018-01-01 00:00:00 | "2018-01-01" | "2018-01-01" | "a" | 1 |
| 2018-01-02 00:00:00 | "2018-01-02" | "2018-01-02" | "b" | hello |
| 2018-01-03 00:00:00 | "2018-01-03" | "2018-01-03" | "c" | 3.14 |
| 2018-01-04 00:00:00 | "2018-01-04" | "2018-01-04" | "d" | [1, 2, 3] |
| 2018-01-05 00:00:00 | "2018-01-05" | "2018-01-05" | "e" | {'a': 1} |
Due to the discrepancies in polars and pandas, I think the best is to pass all string columns as 'VARCHAR' without trying to cast them (same as any pandas string)