Purpose of this issue

This issue arises from a decent amount of time spent industrializing a feature store solution on our EKS platform, where:

Snowflake is the offline data store
Materialization workloads run in Docker containers on Kubernetes, scheduled/orchestrated with Argo
Online features are served out of Postgres

It appears that aspects of the Bytewax materialization have origins in the basic Bytewax example on Kubernetes, but has not been stress tested in a production environment against production-scale data.

This issue aims to revisit aspects of the implementation to bring Kubernetes-based materialization, enabled by Bytewax, up to an acceptable standard.

Expected Behavior

Regardless of materialization dataset size, the Bytewax materialization engine predictably and reliably delivers feature data to online data stores.

Current Behavior

These are some issues experienced on datasets of millions of rows x dozens of features. YMMV.

Basics:

The job that the materialization generates declares an init container that does nothing
There's minimal to no logging in the current implementation, which makes troubleshooting impossible

Executing Materialization:

Exporting a large offline materialization dataset to an intermediary store for processing can result in hundreds of files
The generated job assumes a parallelism of (n) where, (n) is the number of staged files to process
This leads to a job-orchestrated cluster of (n) pods starting up, which...
Depending on the resources allocated to the job, can choke out available Kubernetes nodes or reject the provision of new cluster nodes, which in turn...
Has a high likelihood of failing the job, meaning that the cluster never shuts down, and materialization never completes
There are no controls around how big a number (n) can be. This concern should not necessarily be pushed back upstream to the offline store.
The next execution of materialization compounds this issue

Possible Solution

Tidy up the implementation of the Bytewax materialization engine for implementations running on Kubernetes as a scalable, reliable, and sustainable solution.

Fixes include:

Simplify the job definition to only run the materialization container
Add configuration support for explicitly setting the job parallelism for more predictable job behaviour
Retain the data transformation implementation
Switch the execution mode from a cluster to a one-pod-one-file process, driven by the JOB_COMPLETION_INDEX environment variable set by Kubernetes Job specs having completions > 1
Include a judicious amount of logging to assist in identifying and resolving isolated job failures

May 09 '23 22:05 adamschmidt

@adamschmidt have you tried the snowflake materialization engine?

May 14 '23 01:05 sfc-gh-madkins

cc @whoahbot

Thanks for the feedback!

There's definitely a lot of room to improve wrt adding much better logging + exposing more user configurations, esp wrt data parallelism.

What do you mean by "Retain the data transformation implementation"?

May 16 '23 04:05 adchia

Thank you for taking the time to write up your feedback.

* There's minimal to no logging in the current implementation, which makes troubleshooting impossible

Is the logging that you are referring to specific to the Bytewax materialization process, or to Feast during the materialization process?

Possible Solution

Tidy up the implementation of the Bytewax materialization engine for implementations running on Kubernetes as a scalable, reliable, and sustainable solution.

Fixes include:
* Simplify the job definition to only run the materialization container

:heavy_plus_sign: This is an artifact of the previous implementation which I should have removed when I updated the implementation.

* Add configuration support for explicitly setting the job parallelism for more predictable job behaviour

:heavy_plus_sign: In the initial PR, this value was configurable, but was changed to create one pod for each parquet file. Being able to configure the job parallelism so that a worker processes multiple files makes sense to me.

* Switch the execution mode from a cluster to a one-pod-one-file process, driven by the `JOB_COMPLETION_INDEX` environment variable set by Kubernetes Job specs having `completions` > 1

I'm not sure I follow the implementation here. Could you elaborate?

May 16 '23 23:05 whoahbot

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 17 '23 14:09 stale[bot]

Related issue with other points to consider #3162

Dec 22 '23 12:12 RicardoHS

Bytewax Materialization Engine was retired

Jul 10 '24 17:07 tokoko

feast
feast copied to clipboard

Bytewax Materialization Engine Isn't Production Ready

Purpose of this issue

Expected Behavior

Current Behavior

Possible Solution

Possible Solution

feast feast copied to clipboard

Bytewax Materialization Engine Isn't Production Ready

Purpose of this issue

Expected Behavior

Current Behavior

Possible Solution

Possible Solution

feast
feast copied to clipboard