feast
feast copied to clipboard
FR: Configurable offline store export file sizes + materialization parallelism when using Bytewax materialization engine
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Right now, the batch materialization engines (including Bytewax), leverage the RetrievalJob#to_remote_storage
method to unload feature values to blob storage (GCS/S3/Azure) as parquet files. It then spins up one worker per parquet file to materialize. This can be problematic though, because each offline store has some default size for the exported parquet files, and this size can be too small or too large to fully take advantage of a parallelized materialization engine.
e.g.
- BigQuery will defer to a glob, which has no promises, but roughly splits into 1 GB chunks.
- Snowflake’s current query doesn’t set anything special, so should default to 16MB files? (https://docs.snowflake.com/en/user-guide/data-unload-considerations.html#unloading-to-a-single-file). It does expose a variable to set that file size though.
- Redshift exposes a way to set a max file size too (MAXFILESIZE), but we don’t use it or call so it defaults to 6.2GB. Seems like then the to_remote_storage method should probably take in a max file size, which we can then configure in the materialization engine which would pass that into the unload queries? and for BigQuery, this doesn’t work.
Describe the solution you'd like A clear and concise description of what you want to happen.
Ideally, users would have the ability to configure parameters here that impact materialization parallelism + the max file size when outputting sharded files in remote storage. Two places this could make sense:
- In the
OfflineStoreConfig
files, to specify a default maxfilesize (which works for snowflake and bigquery) - In the
to_remote_storage
, so that users can have some control over how their training dataset is sharded. This can be useful when feeding the output into distributed training.
It might also be worth exposing other bytewax materialization engine parameters such as BYTEWAX_WORKERS_PER_PROCESS
or BYTEWAX_REPLICAS
(i.e. some way to configure parallel materialization from a single parquet file)
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Rely more heavily on the input into feast materialize
, which can split by feature view and by time intervals
Additional context Add any other context or screenshots about the feature request here.
cc @achals @whoahbot @sfc-gh-madkins
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
bump
On Sat, Jan 7, 2023 at 7:47 AM stale[bot] @.***> wrote:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
— Reply to this email directly, view it on GitHub https://github.com/feast-dev/feast/issues/3162#issuecomment-1374487240, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATSRCUZJVE5OXEY7FHB5ANLWRFXVJANCNFSM6AAAAAAQBY4VQM . You are receiving this because you were assigned.Message ID: @.***>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
bump
On Sat, May 20, 2023 at 11:36 AM stale[bot] @.***> wrote:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
— Reply to this email directly, view it on GitHub https://github.com/feast-dev/feast/issues/3162#issuecomment-1555948161, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATSRCU7ZNG3KY4PSELNWJBTXHDXKVANCNFSM6AAAAAAQBY4VQM . You are receiving this because you were assigned.Message ID: @.***>
This request would be very helpful.
At least exposing BYTEWAX_WORKERS_PER_PROCESS
or BYTEWAX_REPLICAS
could be easily achievable?