feast icon indicating copy to clipboard operation
feast copied to clipboard

Unclear error is printed when wrong event_timestamp column type is used

Open woop opened this issue 3 years ago • 5 comments

When running feast materialize-incremental 2022-01-01T00:00:00 on a parquet source that contains to a string based event_timestamp column, the following exception is printed.

Materializing 1 feature views to 2022-01-01 00:00:00-08:00 into the sqlite online store.

fake_data_fv from 2021-05-21 02:11:51-07:00 to 2022-01-01 00:00:00-08:00:
Traceback (most recent call last):
  File "/home/willem/.pyenv/versions/3.7.7/bin/feast", line 8, in <module>
    sys.exit(cli())
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/cli.py", line 270, in materialize_incremental_command
    end_date=datetime.fromisoformat(end_ts),
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/telemetry.py", line 151, in exception_logging_wrapper
    result = func(*args, **kwargs)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/feature_store.py", line 379, in materialize_incremental
    tqdm_builder,
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/infra/local.py", line 193, in materialize_single_feature_view
    end_date=end_date,
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/infra/offline_stores/file.py", line 208, in pull_latest_from_table_or_query
    lambda x: x if x.tzinfo is not None else x.replace(tzinfo=pytz.utc)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pandas/core/series.py", line 3848, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/infra/offline_stores/file.py", line 208, in <lambda>
    lambda x: x if x.tzinfo is not None else x.replace(tzinfo=pytz.utc)
AttributeError: 'str' object has no attribute 'tzinfo'

Instead, we should validate types during materialize and print a clearer error message.

woop avatar May 22 '21 02:05 woop

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 19 '21 06:09 stale[bot]

Similar error when processing the timestamp column: https://github.com/feast-dev/feast/issues/2301

fcas avatar Apr 12 '22 13:04 fcas

@woop do you know some workaround for this issue? It's a stale issue, but the same problem existis even in the version 0.19.4 =/

fcas avatar Apr 12 '22 18:04 fcas

How to fix this error?

sgvarsh avatar Apr 12 '22 20:04 sgvarsh

@sgvarsh the workaround that I found:

from pyspark.sql.functions import to_timestamp

conf = SparkConf().setMaster(SPARK_MASTER)
# FEAST does not work with INT96 (this is the default type using pyspark 
# to write parquet files containing timestamp fields, 
# another option is to use string based timestamps, but...)
# https://issues.apache.org/jira/browse/PARQUET-323
# https://stackoverflow.com/questions/56582539/how-to-save-spark-dataframe-to-parquet-without-using-int96-format-for-timestamp
# FEAST works with TIMESTAMP_MICROS (I did not try TIMESTAMP_MILLIS)
conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
spark_context = SparkContext(conf=conf)
sql_context = SQLContext(spark_context)
df = sql_context.read.csv(path)
df = df.withColumn("event_timestamp", to_timestamp(df.event_timestamp, "yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ"))
## FEAST cannot read a directory with .parquet files
df.coalesce(1).write.mode("overwrite").parquet('output.parquet')

Inspecting the file output.parquet:

############ Column(event_timestamp) ############
name: event_timestamp
path: event_timestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS

Reading the feature view:

training_df = fs.get_historical_features(
        entity_df=entity_df,
        features=[
            "feature_view:***",
            "feature_view:***",
            "feature_view:***",
        ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head(8))
----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
     Column                 Non-Null Count        Dtype              
---  ------                 --------------        -----              
 0   feast_id                     5 non-null      object             
 1   event_timestamp              0 non-null      datetime64[ns, UTC]
 2   ***                          5 non-null      object             
 3   ***                          5 non-null      object             
 4   ***                          5 non-null      object             
dtypes: `datetime64[ns, UTC](1)`,  `object(4)`
memory usage: 240.0+ bytes

----- Example features -----

   feast_id                              ...      ***
0  12f8cbcf-286a-44f6-a84d-e6d9a8fe902a  ...      ***
1  c47e2260-87eb-4748-b63f-cfda3c7fd258  ...      ***
2  7e835362-4ed8-41ed-b81d-7591b38c151d  ...      ***
3  24fa1717-5e92-4a57-bd19-0b3e851ea357  ...      ***
4  8ce9e852-3a4d-4e96-95dc-fa809481c08a  ...      ***

[5 rows x 5 columns]

fcas avatar Apr 13 '22 15:04 fcas

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 20 '22 23:12 stale[bot]