pytorch_geometric icon indicating copy to clipboard operation
pytorch_geometric copied to clipboard

Unable to process movie_lens dataset with local directory transformers model

Open PolarisRisingWar opened this issue 1 year ago • 2 comments

🐛 Describe the bug

This is my code:

from torch_geometric.datasets import MovieLens
dataset = MovieLens(root='/data/pyg_data/MovieLens',model_name='/data/pretrained_model/all-MiniLM-L6-v2')

And it has correctly used the local respiratory for sentence_transformers to process the raw text, but it caused this bug:

Traceback (most recent call last):
  File "try1/try4.py", line 3, in <module>
    dataset = MovieLens(root='/data/pyg_data/MovieLens',model_name='/data/pretrained_model/all-MiniLM-L6-v2')
  File "env_path/lib/python3.8/site-packages/torch_geometric/datasets/movie_lens.py", line 43, in __init__
    super().__init__(root, transform, pre_transform)
  File "env_path/lib/python3.8/site-packages/torch_geometric/data/in_memory_dataset.py", line 50, in __init__
    super().__init__(root, transform, pre_transform, pre_filter)
  File "env_path/lib/python3.8/site-packages/torch_geometric/data/dataset.py", line 87, in __init__
    self._process()
  File "env_path/lib/python3.8/site-packages/torch_geometric/data/dataset.py", line 170, in _process
    self.process()
  File "env_path/lib/python3.8/site-packages/torch_geometric/datasets/movie_lens.py", line 96, in process
    torch.save(self.collate([data]), self.processed_paths[0])
  File "env_path/lib/python3.8/site-packages/torch/serialization.py", line 377, in save
    with _open_file_like(f, 'wb') as opened_file:
  File "env_path/lib/python3.8/site-packages/torch/serialization.py", line 231, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "env_path/lib/python3.8/site-packages/torch/serialization.py", line 212, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/data/pyg_data/MovieLens/processed/data_/data/pretrained_model/all-MiniLM-L6-v2.pt'

And it is clear that this is becase the / in model_name, so I changed this two lines: In the input parameter of __init__(), append: processed_file_name: Optional[str] = "all-MiniLM-L6-v2" In __init__(), append: self.processed_file_name=processed_file_name return f'data_{self.model_name}.pt' change to return f'data_{self.processed_file_name}.pt' And original code change to:

from torch_geometric.datasets import MovieLens
dataset = MovieLens(root='/data/pyg_data/MovieLens',model_name='/data/pretrained_model/all-MiniLM-L6-v2',processed_file_name='all-MiniLM-L6-v2')

Now it works.

Environment

PyG version: 2.1.0.dev20220815 PyTorch version: 1.11.0 OS: Linux Python version: 3.8.13 CUDA/cuDNN version: cuda10.2 cudnn7.6.5 How you installed PyTorch and PyG (conda, pip, source): PyTorch: conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=10.2 -c pytorch

PyG: pip install torch-scatter -f https://data.pyg.org/whl/torch-1.11.0+cu102.html pip install torch-sparse -f https://data.pyg.org/whl/torch-1.11.0+cu102.html pip install pyg-nightly

Any other relevant information (e.g., version of torch-scatter): torch-scatter 2.0.9 torch-sparse 0.6.14

PolarisRisingWar avatar Sep 22 '22 10:09 PolarisRisingWar

Thanks for reporting. Do you want to send a pull request to fix?

rusty1s avatar Sep 22 '22 11:09 rusty1s

OK. I send this pull request: https://github.com/pyg-team/pytorch_geometric/pull/5503

PolarisRisingWar avatar Sep 22 '22 12:09 PolarisRisingWar