PEER_Benchmark icon indicating copy to clipboard operation
PEER_Benchmark copied to clipboard

get datasets?

Open tyang816 opened this issue 2 years ago • 1 comments

Sorry to bother, how can I obtain the processed dataset? Some datasets have been updated.

tyang816 avatar Sep 15 '23 02:09 tyang816

Hi, I tried to same for examine datasets detailly. I cloned the torchdrug repo and created conda env into my machine as they said in README of torchdrug: https://github.com/DeepGraphLearning/torchdrug.git and download one by one with this:

import torchdrug
from torchdrug import datasets, data
import os

print("TorchDrug version:", torchdrug.__version__)

dataset_list = [cls for cls in dir(datasets) if not cls.startswith("_")]
print("\nAvailable TorchDrug datasets:")
for ds in dataset_list:
    cls = getattr(datasets, ds)
    if isinstance(cls, type):
        print("-", ds)

base_path = os.path.expanduser("~/torchdrug_datasets")
os.makedirs(base_path, exist_ok=True)
dataset_name = "SubcellularLocalization"
dataset_path = os.path.join(base_path, dataset_name)

original_from_sequence = data.Protein.from_sequence

def safe_from_sequence(seq, **kwargs):
    valid_aas = set("ACDEFGHIKLMNPQRSTVWY")
    if all(residue in valid_aas for residue in seq):
        return original_from_sequence(seq, **kwargs)
    else:
        # Skip invalid sequence by returning None
        return None

data.Protein.from_sequence = safe_from_sequence

print(f"Downloading {dataset_name} to {dataset_path} ...")
dataset = datasets.SubcellularLocalization(path=dataset_path, transform=None)

valid_samples = []
for i in range(len(dataset)):
    sample = dataset[i]
    # If "graph" is None (because sequence was invalid), skip it
    if sample["graph"] is not None:
        valid_samples.append(sample)

data.Protein.from_sequence = original_from_sequence

But I am not sure is this is the way you want. Also be careful, datasets are quite big so it take time to download.

beyzoskaya avatar Sep 14 '25 11:09 beyzoskaya