dgl [GraphBolt] Dataset dtypes

🔨Work Item

IMPORTANT:

This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

Basically, any tensor that has indices corresponding to node ids should be stored with the int32 dtype.

[ ] ogbn-arxiv dataset has the int64 dtypes while it can use int32 for csc_indptr and indices tensors and all the rest of the node ids tensors.
[x] ogbn-papers100M has less than 2B nodes. Thus, we should store its indices array with the int32 dtype. Also, the train_set, validation_set and test_set should be stored with the int32 dtype as well.
[x] mag240M dataset needs indices, node_type_offset, train_set, validation_set, test_set casted into int32. type_per_edge into uint8 or int8.

We don't need to do anything for all_nodes_set because it automatically gets its dtype from graph.indices.

Saving features

When saving any feature tensors, we should make sure to use gb.numpy_save_aligned instead of numpy.save.

[x] ogbn-papers100M (Thanks to @Liu-rj)
[ ] ogb-lsc-mag240M

Feb 20 '24 16:02 mfbalin

@Rhett-Ying However, all_nodes_set can be an integer so it looks like currently, there is no way to specify its dtype. Casting to int is pretty crucial for performance though, I am hoping that we can find an easy solution.

Feb 20 '24 17:02 mfbalin

#7131 is related to this issue.

Feb 20 '24 17:02 mfbalin

dtype conversion could be covered by https://github.com/dmlc/dgl/pull/7127 when instantiating dataset.

Feb 21 '24 01:02 Rhett-Ying

What about itemsets made of an int: https://github.com/dmlc/dgl/blob/3ced3411e55bca803ed5ec5e1de6f62e1f21478f/python/dgl/graphbolt/itemset.py#L117

Feb 21 '24 02:02 mfbalin

Because sample_neighbors does type checking for the nodes and indices tensors, if the things coming from the itemset don't match the graph indices dtype, it gives an error due to these lines: https://github.com/dmlc/dgl/blob/3ced3411e55bca803ed5ec5e1de6f62e1f21478f/python/dgl/graphbolt/impl/fused_csc_sampling_graph.py#L644-L647

Feb 21 '24 02:02 mfbalin

As for item set that's generated in runtime such as all_nodes_set, we need to figure out a way to format dtype according to graph.indices

Feb 21 '24 02:02 Rhett-Ying

I have changed the indice dtype to int32.

Feb 28 '24 03:02 caojy1998

I have changed the indice dtype to int32.

Now, we are getting an error because train_set etc. is not in int32. Will need #7127 to be merged to resolve the error.

Feb 28 '24 06:02 mfbalin

I have changed the indice dtype to int32.

Now, we are getting an error because train_set etc. is not in int32. Will need #7127 to be merged to resolve the error.

OK, we can wait for that PR to see if there are further questions.

Feb 28 '24 06:02 caojy1998

For papers100M, it says that the data is already preprocessed. That is why it does not perform the required type casts, causing an error.

The dataset is already preprocessed.

For products dataset, however, it performs the preprocessing step when we first download it.

Extracting file to datasets
Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.

Why is there such a discrepancy between these two datasets? I am guessing that we want to take the burden of preprocessing from the user by providing the preprocessed versions of these larger datasets to the users.

Also, after downloading papers, we get the following graph returned:

FusedCSCSamplingGraph(csc_indptr=tensor([         0,          1,          9,  ..., 3228124709, 3228124710,
                              3228124712]),
                      indices=tensor([102309412,   5808518,   6609397,  ...,  92367769,  59629722,
                               95195371]),
                      total_num_nodes=111059956, num_edges=3228124712,
                      node_attributes={},
                      edge_attributes={},)

We get an error because the indices array is not using int32 but train_set is is using int32.

So we need to modify papers100M dataset and cast its indices array of the graph into int32.

@caojy1998

Feb 28 '24 22:02 mfbalin

Also, for the mag240M dataset, if we also provide the preprocessed version, can we perform the following type casts?

indices, node_type_offset, train_set, validation_set, test_set into int32.
type_per_edge into uint8.

We don't need to do anything for all_nodes_set because it automatically gets its dtype from graph.indices.

Feb 28 '24 22:02 mfbalin

@frozenbugs the mag240M dataset has yet to be converted to the right dtypes. Who do you think can tackle this task?

May 09 '24 20:05 mfbalin

@Rhett-Ying When is the work for this issue planned? I see that you measure runtime on mag240M in some of the examples. Fixing this issue would lead to memory and runtime savings anywhere mag240M is used.

Jun 29 '24 04:06 mfbalin

@Rhett-Ying When is the work for this issue planned? I see that you measure runtime on mag240M in some of the examples. Fixing this issue would lead to memory and runtime savings anywhere mag240M is used.

i'll do it this week

Jul 08 '24 05:07 pyynb

Looks like ogbn-arxiv also has the wrong dtypes, updated the issue description.

Jul 23 '24 06:07 mfbalin

@Liu-rj FYI

Aug 02 '24 03:08 Rhett-Ying

@pyynb when the mag240M dataset is downloaded, it extracts into datasets/opt/nvme/..../ogb-lsc-mag240m-seeds directory instead of datasets/ogb-lsc-mag240m-seeds. Could you check?

Aug 18 '24 21:08 mfbalin

@pyynb when the mag240M dataset is downloaded, it extracts into datasets/opt/nvme/..../ogb-lsc-mag240m-seeds directory instead of datasets/ogb-lsc-mag240m-seeds. Could you check?

I'm on it

Aug 19 '24 06:08 pyynb

dgl dgl copied to clipboard

[GraphBolt] Dataset dtypes

🔨Work Item

Description

Saving features

dgl
dgl copied to clipboard