raydp
raydp copied to clipboard
How do I store the full graph data in the ObjectStore of each node?
I have been following the progress of RayDP and trying to use it for data processing. But now I use spark= raydp.init_spark(...) to read a large graph data, convert it to DataFrame, and then call create_ml_dataset_from_spark() to store it in the ObjectStore. My ray cluster has three nodes. I want to store one copy of the full image data in the ObjectStore of each node. So I executed the above method three times, and found that one of the machines stored 2 copies, one stored 1 copy, and The ObjectStore of the last machine is empty. After trying many times, it is difficult to store the whole image data in the ObjectStore of each node. Can this method (task) be individually scheduled to execute under a certain node? Looking forward to your reply~
Hi, glad you tried raydp.
Ray's object store is shared among nodes. By calling our create_ml_dataset_from_spark
, you create a MLDataset
, which is partitioned. That means your data is probably distibuted across nodes. Why do you want the full graph data available on each node? What do you want to do in the next step? In ray, you can always refer to the data you put in object store in a remote task, and if the node where the task is scheduled does not have the data yet, it will fetch the data first. So I guess there is no need to call create_ml_dataset_from_spark
3 times manually.
Hi, glad you tried raydp.
Ray's object store is shared among nodes. By calling our
create_ml_dataset_from_spark
, you create aMLDataset
, which is partitioned. That means your data is probably distibuted across nodes. Why do you want the full graph data available on each node? What do you want to do in the next step? In ray, you can always refer to the data you put in object store in a remote task, and if the node where the task is scheduled does not have the data yet, it will fetch the data first. So I guess there is no need to callcreate_ml_dataset_from_spark
3 times manually.
I hope to use RaySGD to start three deep learning training processes (Actors) on each machine. Each process can directly obtain full graph data from the local ObjectStore for sampling training, thereby avoiding fetching data from other nodes and causing network transmission overhead. . @kira-lin
Is it a GNN application? Like each node needs a full graph, but node/edge features can be partitioned? Anyway I guess you can save the graph to parquet fist, and copy it to the same path on each node. Then use our from_parquet
to load it on each node in remote task, then return the loaded data(as objectref). Finally, use this in actors on each node.
Hi @YeahNew , I can not find your reply, but I saw it in my mailbox. Have you solved the problem? I think you don't need to use MLDataset for the graph data, because you do not want it partitioned on each node. You probably want to use it to load edge/node feature, of which different part is used in different nodes. For the graph data, if it is only used in the training actor, you don't even need to put it into ray object store.
Hi @YeahNew , I can not find your reply, but I saw it in my mailbox. Have you solved the problem? I think you don't need to use MLDataset for the graph data, because you do not want it partitioned on each node. You probably want to use it to load edge/node feature, of which different part is used in different nodes. For the graph data, if it is only used in the training actor, you don't even need to put it into ray object store.
@kira-lin Yes, thank you for your reply, I have solved this problem. Sorry for the busy time and did not reply to you.
Now I have another problem. The embedding I got from model training is in ndarray array format. I need to store it in ObjectStore. But the create_ml_dataset_from_spark method only supports data storage in sql.DataFrame format. Could you provide data storage method by using other formats? Looking forward to your reply~
close as stale