find_column_type takes too long for large dataset

Open gladysteh99 opened this issue 8 months ago • 0 comments

Trying to parse a string to timestamp takes very long when the dataframe is large with many rows for actual string columns that are not in any timestamp format.

Possible solutions:

Thinking if we could leave it to the user to parse it explicitly in their original dataframe before uploading to snowflake? Technically, Polars users should have clear distinction since polars only has separate string column type and temporal dtypes, unlike the object dtype in pandas which could include pretty much everything.
Use a sample of the records instead of everything
Remove trying to cast "str" to any other datatypes. Technically, an object type shouldn't exist in polars DataFrame. Even if it did, our current validate_X_object helper functions for polars do not cover them as polars just doesn't have any functions for object type. Our current helper functions only cast from string to date/float, which is different from the behaviour for a pandas dataframe.

Justification for 3. Small example:

import numpy as np
import pandas as pd
arr = np.array([1, "hello", 3.14, [1, 2, 3], {"a": 1}], dtype=object)
date = pd.date_range(start='1/1/2018', end='1/05/2018')

pd_df=pd.DataFrame({"mixed":arr,"date": date})
pd_df["date_str"] = pd_df["date"].dt.strftime('%Y-%m-%d').astype('string')
pd_df["date_object"] = pd_df["date"].dt.strftime('%Y-%m-%d')
pd_df["string"] = ["a","b","c","d","e"]
pd_df.dtypes

gives

mixed                  object
date           datetime64[ns]
date_str       string[python]
date_object            object
string                 object
dtype: object

import locopy
from locopy.utility import find_column_type

find_column_type(pd_df, "snowflake")

outputs: OrderedDict([('mixed', 'varchar'), ('date', 'timestamp'), ('date_str', 'varchar'), ('date_object', 'date'), ('string', 'varchar')])

On the other hand for polars (see the difference in "date_str")

pl_df=pl.from_pandas(pd_df[["date","date_str","date_object","string"]])
pl_df=pl_df.with_columns(mixed =arr) #adding explicitly otherwise it will error out if we use an object type in from_pandas
find_column_type(pl_df,"snowflake")

outputs: OrderedDict([('date', 'timestamp'), ('date_str', 'date'), ('date_object', 'date'), ('string', 'varchar'), ('mixed', 'varchar')])

polars also automatically try to infer datatypes

eg. pl_df["date_object"] is automatically changed to "string"

shape: (5, 5)

date	date_str	date_object	string	mixed
datetime[ns]	str	str	str	object
2018-01-01 00:00:00	"2018-01-01"	"2018-01-01"	"a"	1
2018-01-02 00:00:00	"2018-01-02"	"2018-01-02"	"b"	hello
2018-01-03 00:00:00	"2018-01-03"	"2018-01-03"	"c"	3.14
2018-01-04 00:00:00	"2018-01-04"	"2018-01-04"	"d"	[1, 2, 3]
2018-01-05 00:00:00	"2018-01-05"	"2018-01-05"	"e"	{'a': 1}

Due to the discrepancies in polars and pandas, I think the best is to pass all string columns as 'VARCHAR' without trying to cast them (same as any pandas string)

Apr 30 '25 14:04 gladysteh99