feast
feast copied to clipboard
Unclear error is printed when wrong event_timestamp column type is used
When running feast materialize-incremental 2022-01-01T00:00:00
on a parquet source that contains to a string based event_timestamp
column, the following exception is printed.
Materializing 1 feature views to 2022-01-01 00:00:00-08:00 into the sqlite online store.
fake_data_fv from 2021-05-21 02:11:51-07:00 to 2022-01-01 00:00:00-08:00:
Traceback (most recent call last):
File "/home/willem/.pyenv/versions/3.7.7/bin/feast", line 8, in <module>
sys.exit(cli())
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/cli.py", line 270, in materialize_incremental_command
end_date=datetime.fromisoformat(end_ts),
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/telemetry.py", line 151, in exception_logging_wrapper
result = func(*args, **kwargs)
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/feature_store.py", line 379, in materialize_incremental
tqdm_builder,
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/infra/local.py", line 193, in materialize_single_feature_view
end_date=end_date,
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/infra/offline_stores/file.py", line 208, in pull_latest_from_table_or_query
lambda x: x if x.tzinfo is not None else x.replace(tzinfo=pytz.utc)
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pandas/core/series.py", line 3848, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/infra/offline_stores/file.py", line 208, in <lambda>
lambda x: x if x.tzinfo is not None else x.replace(tzinfo=pytz.utc)
AttributeError: 'str' object has no attribute 'tzinfo'
Instead, we should validate types during materialize and print a clearer error message.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Similar error when processing the timestamp column: https://github.com/feast-dev/feast/issues/2301
@woop do you know some workaround for this issue? It's a stale issue, but the same problem existis even in the version 0.19.4 =/
How to fix this error?
@sgvarsh the workaround that I found:
from pyspark.sql.functions import to_timestamp
conf = SparkConf().setMaster(SPARK_MASTER)
# FEAST does not work with INT96 (this is the default type using pyspark
# to write parquet files containing timestamp fields,
# another option is to use string based timestamps, but...)
# https://issues.apache.org/jira/browse/PARQUET-323
# https://stackoverflow.com/questions/56582539/how-to-save-spark-dataframe-to-parquet-without-using-int96-format-for-timestamp
# FEAST works with TIMESTAMP_MICROS (I did not try TIMESTAMP_MILLIS)
conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
spark_context = SparkContext(conf=conf)
sql_context = SQLContext(spark_context)
df = sql_context.read.csv(path)
df = df.withColumn("event_timestamp", to_timestamp(df.event_timestamp, "yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ"))
## FEAST cannot read a directory with .parquet files
df.coalesce(1).write.mode("overwrite").parquet('output.parquet')
Inspecting the file output.parquet
:
############ Column(event_timestamp) ############
name: event_timestamp
path: event_timestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
Reading the feature view:
training_df = fs.get_historical_features(
entity_df=entity_df,
features=[
"feature_view:***",
"feature_view:***",
"feature_view:***",
],
).to_df()
print("----- Feature schema -----\n")
print(training_df.info())
print()
print("----- Example features -----\n")
print(training_df.head(8))
----- Feature schema -----
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
Column Non-Null Count Dtype
--- ------ -------------- -----
0 feast_id 5 non-null object
1 event_timestamp 0 non-null datetime64[ns, UTC]
2 *** 5 non-null object
3 *** 5 non-null object
4 *** 5 non-null object
dtypes: `datetime64[ns, UTC](1)`, `object(4)`
memory usage: 240.0+ bytes
----- Example features -----
feast_id ... ***
0 12f8cbcf-286a-44f6-a84d-e6d9a8fe902a ... ***
1 c47e2260-87eb-4748-b63f-cfda3c7fd258 ... ***
2 7e835362-4ed8-41ed-b81d-7591b38c151d ... ***
3 24fa1717-5e92-4a57-bd19-0b3e851ea357 ... ***
4 8ce9e852-3a4d-4e96-95dc-fa809481c08a ... ***
[5 rows x 5 columns]
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.