dgl
dgl copied to clipboard
[GraphBolt] Dataset dtypes
🔨Work Item
IMPORTANT:
- This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
- DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.
Project tracker: https://github.com/orgs/dmlc/projects/2
Description
Basically, any tensor that has indices corresponding to node ids should be stored with the int32 dtype.
- [ ] ogbn-arxiv dataset has the int64 dtypes while it can use int32 for csc_indptr and indices tensors and all the rest of the node ids tensors.
- [x] ogbn-papers100M has less than 2B nodes. Thus, we should store its indices array with the int32 dtype. Also, the train_set, validation_set and test_set should be stored with the int32 dtype as well.
- [x] mag240M dataset needs
indices
,node_type_offset
,train_set
,validation_set
,test_set
casted into int32.type_per_edge
into uint8 or int8.
We don't need to do anything for all_nodes_set
because it automatically gets its dtype from graph.indices.
Saving features
When saving any feature tensors, we should make sure to use gb.numpy_save_aligned
instead of numpy.save
.
- [x] ogbn-papers100M (Thanks to @Liu-rj)
- [ ] ogb-lsc-mag240M
@Rhett-Ying However, all_nodes_set can be an integer so it looks like currently, there is no way to specify its dtype. Casting to int is pretty crucial for performance though, I am hoping that we can find an easy solution.
#7131 is related to this issue.
dtype conversion could be covered by https://github.com/dmlc/dgl/pull/7127 when instantiating dataset.
What about itemsets made of an int: https://github.com/dmlc/dgl/blob/3ced3411e55bca803ed5ec5e1de6f62e1f21478f/python/dgl/graphbolt/itemset.py#L117
Because sample_neighbors does type checking for the nodes and indices tensors, if the things coming from the itemset don't match the graph indices dtype, it gives an error due to these lines: https://github.com/dmlc/dgl/blob/3ced3411e55bca803ed5ec5e1de6f62e1f21478f/python/dgl/graphbolt/impl/fused_csc_sampling_graph.py#L644-L647
As for item set that's generated in runtime such as all_nodes_set
, we need to figure out a way to format dtype according to graph.indices
I have changed the indice dtype to int32.
I have changed the indice dtype to int32.
Now, we are getting an error because train_set etc. is not in int32. Will need #7127 to be merged to resolve the error.
I have changed the indice dtype to int32.
Now, we are getting an error because train_set etc. is not in int32. Will need #7127 to be merged to resolve the error.
OK, we can wait for that PR to see if there are further questions.
For papers100M, it says that the data is already preprocessed. That is why it does not perform the required type casts, causing an error.
The dataset is already preprocessed.
For products dataset, however, it performs the preprocessing step when we first download it.
Extracting file to datasets
Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.
Why is there such a discrepancy between these two datasets? I am guessing that we want to take the burden of preprocessing from the user by providing the preprocessed versions of these larger datasets to the users.
Also, after downloading papers, we get the following graph returned:
FusedCSCSamplingGraph(csc_indptr=tensor([ 0, 1, 9, ..., 3228124709, 3228124710,
3228124712]),
indices=tensor([102309412, 5808518, 6609397, ..., 92367769, 59629722,
95195371]),
total_num_nodes=111059956, num_edges=3228124712,
node_attributes={},
edge_attributes={},)
We get an error because the indices array is not using int32
but train_set is is using int32
.
So we need to modify papers100M
dataset and cast its indices
array of the graph into int32
.
@caojy1998
Also, for the mag240M dataset, if we also provide the preprocessed version, can we perform the following type casts?
-
indices
,node_type_offset
,train_set
,validation_set
,test_set
intoint32
. -
type_per_edge
intouint8
.
We don't need to do anything for all_nodes_set
because it automatically gets its dtype from graph.indices
.
@frozenbugs the mag240M dataset has yet to be converted to the right dtypes. Who do you think can tackle this task?
@Rhett-Ying When is the work for this issue planned? I see that you measure runtime on mag240M in some of the examples. Fixing this issue would lead to memory and runtime savings anywhere mag240M is used.
@Rhett-Ying When is the work for this issue planned? I see that you measure runtime on mag240M in some of the examples. Fixing this issue would lead to memory and runtime savings anywhere mag240M is used.
i'll do it this week
Looks like ogbn-arxiv also has the wrong dtypes, updated the issue description.
@Liu-rj FYI
@pyynb when the mag240M dataset is downloaded, it extracts into datasets/opt/nvme/..../ogb-lsc-mag240m-seeds
directory instead of datasets/ogb-lsc-mag240m-seeds
. Could you check?
@pyynb when the mag240M dataset is downloaded, it extracts into
datasets/opt/nvme/..../ogb-lsc-mag240m-seeds
directory instead ofdatasets/ogb-lsc-mag240m-seeds
. Could you check?
I'm on it