raydp icon indicating copy to clipboard operation
raydp copied to clipboard

What are the disadvantages when the data Preprocessing and model training pipelines are integrated via raydp?

Open sparkinglu opened this issue 3 years ago • 1 comments

Hello

I am not sure if this is right place to have this question since I do not find other places to ask such questions.

I can see the advantages to integrate them together. However, if the output from data preprocessing is feed into the model training via ray in-memory object store, the data preprocessing has to be re-run when the model is re-trained in case there are some changes in model algorithm. If I am correct, does that mean such pre-reprocessing is a huge waste (from resources and time)? If data preprocessing output is persisted somewhere, we can always use the same persisted output to train model repeated.

Besides this one, any other disadvantages to have such integrated pipeline, data size is limited by the object store capacity, pipeline reliability impacted by the in-memory object store?

Thanks Steven Lu

sparkinglu avatar Jul 31 '21 04:07 sparkinglu

Hi @sparkinglu , you can also persist the Spark processed data to a distributed filesystem using RayDP if that better fits your needs. It depends on your use cases and your organization's strategy. For example if you have data continuously coming in and you need to repeatedly train models on the new data, usually it is good to have the integrated pipeline and use Ray object store to exchange the data. Some organizations also choose to rerun some of the data preprocessing phases for different pipelines, because they have free CPU resource in the cluster and by using this way they don't need to worry about when and who to delete the intermediate preprocessed data. Ray object store supports spilling so the data size should not be limited.

carsonwang avatar Aug 02 '21 08:08 carsonwang

close as stale

kira-lin avatar Apr 14 '23 08:04 kira-lin