pytorch_geometric
pytorch_geometric copied to clipboard
Unable to process movie_lens dataset with local directory transformers model
🐛 Describe the bug
This is my code:
from torch_geometric.datasets import MovieLens
dataset = MovieLens(root='/data/pyg_data/MovieLens',model_name='/data/pretrained_model/all-MiniLM-L6-v2')
And it has correctly used the local respiratory for sentence_transformers to process the raw text, but it caused this bug:
Traceback (most recent call last):
File "try1/try4.py", line 3, in <module>
dataset = MovieLens(root='/data/pyg_data/MovieLens',model_name='/data/pretrained_model/all-MiniLM-L6-v2')
File "env_path/lib/python3.8/site-packages/torch_geometric/datasets/movie_lens.py", line 43, in __init__
super().__init__(root, transform, pre_transform)
File "env_path/lib/python3.8/site-packages/torch_geometric/data/in_memory_dataset.py", line 50, in __init__
super().__init__(root, transform, pre_transform, pre_filter)
File "env_path/lib/python3.8/site-packages/torch_geometric/data/dataset.py", line 87, in __init__
self._process()
File "env_path/lib/python3.8/site-packages/torch_geometric/data/dataset.py", line 170, in _process
self.process()
File "env_path/lib/python3.8/site-packages/torch_geometric/datasets/movie_lens.py", line 96, in process
torch.save(self.collate([data]), self.processed_paths[0])
File "env_path/lib/python3.8/site-packages/torch/serialization.py", line 377, in save
with _open_file_like(f, 'wb') as opened_file:
File "env_path/lib/python3.8/site-packages/torch/serialization.py", line 231, in _open_file_like
return _open_file(name_or_buffer, mode)
File "env_path/lib/python3.8/site-packages/torch/serialization.py", line 212, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/data/pyg_data/MovieLens/processed/data_/data/pretrained_model/all-MiniLM-L6-v2.pt'
And it is clear that this is becase the /
in model_name, so I changed this two lines:
In the input parameter of __init__()
, append: processed_file_name: Optional[str] = "all-MiniLM-L6-v2"
In __init__()
, append: self.processed_file_name=processed_file_name
return f'data_{self.model_name}.pt'
change to return f'data_{self.processed_file_name}.pt'
And original code change to:
from torch_geometric.datasets import MovieLens
dataset = MovieLens(root='/data/pyg_data/MovieLens',model_name='/data/pretrained_model/all-MiniLM-L6-v2',processed_file_name='all-MiniLM-L6-v2')
Now it works.
Environment
PyG version: 2.1.0.dev20220815 PyTorch version: 1.11.0 OS: Linux Python version: 3.8.13 CUDA/cuDNN version: cuda10.2 cudnn7.6.5 How you installed PyTorch and PyG (conda, pip, source): PyTorch: conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=10.2 -c pytorch
PyG: pip install torch-scatter -f https://data.pyg.org/whl/torch-1.11.0+cu102.html pip install torch-sparse -f https://data.pyg.org/whl/torch-1.11.0+cu102.html pip install pyg-nightly
Any other relevant information (e.g., version of torch-scatter): torch-scatter 2.0.9 torch-sparse 0.6.14
Thanks for reporting. Do you want to send a pull request to fix?
OK. I send this pull request: https://github.com/pyg-team/pytorch_geometric/pull/5503