feast
feast copied to clipboard
Bytewax Materialization Engine Isn't Production Ready
Purpose of this issue
This issue arises from a decent amount of time spent industrializing a feature store solution on our EKS platform, where:
- Snowflake is the offline data store
- Materialization workloads run in Docker containers on Kubernetes, scheduled/orchestrated with Argo
- Online features are served out of Postgres
It appears that aspects of the Bytewax materialization have origins in the basic Bytewax example on Kubernetes, but has not been stress tested in a production environment against production-scale data.
This issue aims to revisit aspects of the implementation to bring Kubernetes-based materialization, enabled by Bytewax, up to an acceptable standard.
Expected Behavior
Regardless of materialization dataset size, the Bytewax materialization engine predictably and reliably delivers feature data to online data stores.
Current Behavior
These are some issues experienced on datasets of millions of rows x dozens of features. YMMV.
Basics:
- The job that the materialization generates declares an init container that does nothing
- There's minimal to no logging in the current implementation, which makes troubleshooting impossible
Executing Materialization:
- Exporting a large offline materialization dataset to an intermediary store for processing can result in hundreds of files
- The generated job assumes a parallelism of
(n)
where,(n)
is the number of staged files to process - This leads to a job-orchestrated cluster of
(n)
pods starting up, which... - Depending on the resources allocated to the job, can choke out available Kubernetes nodes or reject the provision of new cluster nodes, which in turn...
- Has a high likelihood of failing the job, meaning that the cluster never shuts down, and materialization never completes
- There are no controls around how big a number
(n)
can be. This concern should not necessarily be pushed back upstream to the offline store. - The next execution of materialization compounds this issue
Possible Solution
Tidy up the implementation of the Bytewax materialization engine for implementations running on Kubernetes as a scalable, reliable, and sustainable solution.
Fixes include:
- Simplify the job definition to only run the materialization container
- Add configuration support for explicitly setting the job parallelism for more predictable job behaviour
- Retain the data transformation implementation
- Switch the execution mode from a cluster to a one-pod-one-file process, driven by the
JOB_COMPLETION_INDEX
environment variable set by Kubernetes Job specs havingcompletions
> 1 - Include a judicious amount of logging to assist in identifying and resolving isolated job failures
@adamschmidt have you tried the snowflake materialization engine?
cc @whoahbot
Thanks for the feedback!
There's definitely a lot of room to improve wrt adding much better logging + exposing more user configurations, esp wrt data parallelism.
What do you mean by "Retain the data transformation implementation"?
Thank you for taking the time to write up your feedback.
* There's minimal to no logging in the current implementation, which makes troubleshooting impossible
Is the logging that you are referring to specific to the Bytewax materialization process, or to Feast during the materialization process?
Possible Solution
Tidy up the implementation of the Bytewax materialization engine for implementations running on Kubernetes as a scalable, reliable, and sustainable solution.
Fixes include:
* Simplify the job definition to only run the materialization container
:heavy_plus_sign: This is an artifact of the previous implementation which I should have removed when I updated the implementation.
* Add configuration support for explicitly setting the job parallelism for more predictable job behaviour
:heavy_plus_sign: In the initial PR, this value was configurable, but was changed to create one pod for each parquet file. Being able to configure the job parallelism so that a worker processes multiple files makes sense to me.
* Switch the execution mode from a cluster to a one-pod-one-file process, driven by the `JOB_COMPLETION_INDEX` environment variable set by Kubernetes Job specs having `completions` > 1
I'm not sure I follow the implementation here. Could you elaborate?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Related issue with other points to consider #3162
Bytewax Materialization Engine was retired