raydp icon indicating copy to clipboard operation
raydp copied to clipboard

How do I store the full graph data in the ObjectStore of each node?

Open YeahNew opened this issue 3 years ago • 5 comments

I have been following the progress of RayDP and trying to use it for data processing. But now I use spark= raydp.init_spark(...) to read a large graph data, convert it to DataFrame, and then call create_ml_dataset_from_spark() to store it in the ObjectStore. My ray cluster has three nodes. I want to store one copy of the full image data in the ObjectStore of each node. So I executed the above method three times, and found that one of the machines stored 2 copies, one stored 1 copy, and The ObjectStore of the last machine is empty. After trying many times, it is difficult to store the whole image data in the ObjectStore of each node. Can this method (task) be individually scheduled to execute under a certain node? Looking forward to your reply~

YeahNew avatar Jul 28 '21 07:07 YeahNew

Hi, glad you tried raydp.

Ray's object store is shared among nodes. By calling our create_ml_dataset_from_spark, you create a MLDataset, which is partitioned. That means your data is probably distibuted across nodes. Why do you want the full graph data available on each node? What do you want to do in the next step? In ray, you can always refer to the data you put in object store in a remote task, and if the node where the task is scheduled does not have the data yet, it will fetch the data first. So I guess there is no need to call create_ml_dataset_from_spark 3 times manually.

kira-lin avatar Jul 28 '21 08:07 kira-lin

Hi, glad you tried raydp.

Ray's object store is shared among nodes. By calling our create_ml_dataset_from_spark, you create a MLDataset, which is partitioned. That means your data is probably distibuted across nodes. Why do you want the full graph data available on each node? What do you want to do in the next step? In ray, you can always refer to the data you put in object store in a remote task, and if the node where the task is scheduled does not have the data yet, it will fetch the data first. So I guess there is no need to call create_ml_dataset_from_spark 3 times manually.

I hope to use RaySGD to start three deep learning training processes (Actors) on each machine. Each process can directly obtain full graph data from the local ObjectStore for sampling training, thereby avoiding fetching data from other nodes and causing network transmission overhead. . @kira-lin

YeahNew avatar Jul 28 '21 10:07 YeahNew

Is it a GNN application? Like each node needs a full graph, but node/edge features can be partitioned? Anyway I guess you can save the graph to parquet fist, and copy it to the same path on each node. Then use our from_parquet to load it on each node in remote task, then return the loaded data(as objectref). Finally, use this in actors on each node.

kira-lin avatar Jul 29 '21 01:07 kira-lin

Hi @YeahNew , I can not find your reply, but I saw it in my mailbox. Have you solved the problem? I think you don't need to use MLDataset for the graph data, because you do not want it partitioned on each node. You probably want to use it to load edge/node feature, of which different part is used in different nodes. For the graph data, if it is only used in the training actor, you don't even need to put it into ray object store.

kira-lin avatar Aug 02 '21 07:08 kira-lin

Hi @YeahNew , I can not find your reply, but I saw it in my mailbox. Have you solved the problem? I think you don't need to use MLDataset for the graph data, because you do not want it partitioned on each node. You probably want to use it to load edge/node feature, of which different part is used in different nodes. For the graph data, if it is only used in the training actor, you don't even need to put it into ray object store.

@kira-lin Yes, thank you for your reply, I have solved this problem. Sorry for the busy time and did not reply to you.

Now I have another problem. The embedding I got from model training is in ndarray array format. I need to store it in ObjectStore. But the create_ml_dataset_from_spark method only supports data storage in sql.DataFrame format. Could you provide data storage method by using other formats? Looking forward to your reply~

YeahNew avatar Aug 14 '21 03:08 YeahNew

close as stale

kira-lin avatar Apr 14 '23 08:04 kira-lin