feast icon indicating copy to clipboard operation
feast copied to clipboard

FR: Configurable offline store export file sizes + materialization parallelism when using Bytewax materialization engine

Open adchia opened this issue 2 years ago • 1 comments

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Right now, the batch materialization engines (including Bytewax), leverage the RetrievalJob#to_remote_storage method to unload feature values to blob storage (GCS/S3/Azure) as parquet files. It then spins up one worker per parquet file to materialize. This can be problematic though, because each offline store has some default size for the exported parquet files, and this size can be too small or too large to fully take advantage of a parallelized materialization engine.

e.g.

  • BigQuery will defer to a glob, which has no promises, but roughly splits into 1 GB chunks.
  • Snowflake’s current query doesn’t set anything special, so should default to 16MB files? (https://docs.snowflake.com/en/user-guide/data-unload-considerations.html#unloading-to-a-single-file). It does expose a variable to set that file size though.
  • Redshift exposes a way to set a max file size too (MAXFILESIZE), but we don’t use it or call so it defaults to 6.2GB. Seems like then the to_remote_storage method should probably take in a max file size, which we can then configure in the materialization engine which would pass that into the unload queries? and for BigQuery, this doesn’t work.

Describe the solution you'd like A clear and concise description of what you want to happen.

Ideally, users would have the ability to configure parameters here that impact materialization parallelism + the max file size when outputting sharded files in remote storage. Two places this could make sense:

  • In the OfflineStoreConfig files, to specify a default maxfilesize (which works for snowflake and bigquery)
  • In the to_remote_storage, so that users can have some control over how their training dataset is sharded. This can be useful when feeding the output into distributed training.

It might also be worth exposing other bytewax materialization engine parameters such as BYTEWAX_WORKERS_PER_PROCESS or BYTEWAX_REPLICAS (i.e. some way to configure parallel materialization from a single parquet file)

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Rely more heavily on the input into feast materialize, which can split by feature view and by time intervals

Additional context Add any other context or screenshots about the feature request here.

adchia avatar Aug 31 '22 21:08 adchia

cc @achals @whoahbot @sfc-gh-madkins

adchia avatar Aug 31 '22 21:08 adchia

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 07 '23 13:01 stale[bot]

bump

On Sat, Jan 7, 2023 at 7:47 AM stale[bot] @.***> wrote:

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

— Reply to this email directly, view it on GitHub https://github.com/feast-dev/feast/issues/3162#issuecomment-1374487240, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATSRCUZJVE5OXEY7FHB5ANLWRFXVJANCNFSM6AAAAAAQBY4VQM . You are receiving this because you were assigned.Message ID: @.***>

sfc-gh-madkins avatar Jan 07 '23 16:01 sfc-gh-madkins

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 20 '23 16:05 stale[bot]

bump

On Sat, May 20, 2023 at 11:36 AM stale[bot] @.***> wrote:

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

— Reply to this email directly, view it on GitHub https://github.com/feast-dev/feast/issues/3162#issuecomment-1555948161, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATSRCU7ZNG3KY4PSELNWJBTXHDXKVANCNFSM6AAAAAAQBY4VQM . You are receiving this because you were assigned.Message ID: @.***>

sfc-gh-madkins avatar May 20 '23 16:05 sfc-gh-madkins

This request would be very helpful. At least exposing BYTEWAX_WORKERS_PER_PROCESS or BYTEWAX_REPLICAS could be easily achievable?

RicardoHS avatar Dec 22 '23 12:12 RicardoHS