raydp
raydp copied to clipboard
How to store other types of data in ObjectStore distributedly ?
After I use raydp to read the data, I got the data in tensor format after GNN training. Now I want to store it into ObjectStore distributedly, and then I can take it out and combine it with other feature data for logistic regression (Distributed Scikit-learn / Joblib). But by calling create_ml_dataset_from_spark(), I found that this method only accepts data of type sql.DataFrame. Do I need to convert the data? Can you provide methods to store other data types in the future?
Hi @YeahNew , it seems this is no longer related to Spark. If your data is not in Spark, probably you can directly use Ray's API instead of APIs from RayDP. If you still want to create a MLDataset, you can create a Ray parallel iterator and use Ray MLDataset APIs from_parallel_it. By the way, we are collecting some real RayDP use cases. Your workload sounds very interesting. Are you using RayDP and other frameworks for a real use case in your company?
Hi @YeahNew , it seems this is no longer related to Spark. If your data is not in Spark, probably you can directly use Ray's API instead of APIs from RayDP. If you still want to create a MLDataset, you can create a Ray parallel iterator and use Ray MLDataset APIs from_parallel_it. By the way, we are collecting some real RayDP use cases. Your workload sounds very interesting. Are you using RayDP and other frameworks for a real use case in your company?
OK, I get. Yes. It seems that ray APIs does not provide a method that can directly store data in a distributed manner into the ObjectStore.
close as stale. Putting data into ray object store will ensure you can fetch it from any node of the ray cluster. If the data is already distributed when you put it, then it'll be distributed. If it's not, you also don't need to do extra things because it will be fetched to the node where you want to use it when you call ray.get