MLDatasets.jl icon indicating copy to clipboard operation
MLDatasets.jl copied to clipboard

load torch tensors in OGBDatasets

Open CarloLucibello opened this issue 2 years ago • 7 comments

Some of the features of the OGBDataset are downloaded as torch tensor stored in the ".pt" format. They are currently ignored at the moment, but we could load them using Pickle.jl (e.g. see this comment)

CarloLucibello avatar Mar 27 '22 15:03 CarloLucibello

Can I work on this?

Dsantra92 avatar Apr 18 '22 15:04 Dsantra92

Sure. I don't remember for which specific dataset this was needed though

CarloLucibello avatar Apr 19 '22 06:04 CarloLucibello

This problem can be seen in OGBDataset("ogbl-collab")

CarloLucibello avatar May 08 '22 09:05 CarloLucibello

Been inactive due to Uni. exams, will start working on it today.

Dsantra92 avatar May 15 '22 06:05 Dsantra92

Some problems have been overcome here, including loading ".pt" format using Pickle.jl and have been discussed with @chengchingwen : https://github.com/yuehhua/GraphMLDatasets.jl/blob/65d6a2bb02d31569a64b47004a0c4b192739a066/src/preprocess.jl#L391 Hope these code help.

yuehhua avatar May 21 '22 02:05 yuehhua

Split tensors appear for edge-level tasks in OGB Datasets. The dataset loading for LinkPropped Datasets differs from GraphPropped or NodePropped. We might need a change of OGB-Dataset APIs. Here are some approaches:

  1. Mention the split of the dataset
data = OGBDataset(name, split; dir)

But this has one obvious problem: loading any split eg. train would involve computation of the other two splits (val and test) given the intertwined nature of how the data is stored.

  1. Return train, test and validation split for each dataset
train_data, test_data, valid_data = OGBDataset(name; dir)

Can be ambiguous for non-split datasets and does not exactly match with other dataset APIs.

  1. Compute split from dataset
data = OGBDataset(name; dir)
train_split = split(data, :train) # this may weird way to do
# maybe something like
train_split = data[:train]

Representation for link tasks in OGBDataset will differ from Node or Graph tasks.

Dsantra92 avatar Jun 13 '22 17:06 Dsantra92

Also, API for splits should be consistent for different data sources. eg: Cora and OGBDataset access training masks using different APIs.

Dsantra92 avatar Jul 11 '22 20:07 Dsantra92