PEER_Benchmark
PEER_Benchmark copied to clipboard
get datasets?
Sorry to bother, how can I obtain the processed dataset? Some datasets have been updated.
Hi, I tried to same for examine datasets detailly. I cloned the torchdrug repo and created conda env into my machine as they said in README of torchdrug: https://github.com/DeepGraphLearning/torchdrug.git and download one by one with this:
import torchdrug
from torchdrug import datasets, data
import os
print("TorchDrug version:", torchdrug.__version__)
dataset_list = [cls for cls in dir(datasets) if not cls.startswith("_")]
print("\nAvailable TorchDrug datasets:")
for ds in dataset_list:
cls = getattr(datasets, ds)
if isinstance(cls, type):
print("-", ds)
base_path = os.path.expanduser("~/torchdrug_datasets")
os.makedirs(base_path, exist_ok=True)
dataset_name = "SubcellularLocalization"
dataset_path = os.path.join(base_path, dataset_name)
original_from_sequence = data.Protein.from_sequence
def safe_from_sequence(seq, **kwargs):
valid_aas = set("ACDEFGHIKLMNPQRSTVWY")
if all(residue in valid_aas for residue in seq):
return original_from_sequence(seq, **kwargs)
else:
# Skip invalid sequence by returning None
return None
data.Protein.from_sequence = safe_from_sequence
print(f"Downloading {dataset_name} to {dataset_path} ...")
dataset = datasets.SubcellularLocalization(path=dataset_path, transform=None)
valid_samples = []
for i in range(len(dataset)):
sample = dataset[i]
# If "graph" is None (because sequence was invalid), skip it
if sample["graph"] is not None:
valid_samples.append(sample)
data.Protein.from_sequence = original_from_sequence
But I am not sure is this is the way you want. Also be careful, datasets are quite big so it take time to download.