Error when testing my model using config files

Open MasterLucas opened this issue 6 months ago • 1 comments

Dear Schnetpack developers,

I'm encountering this issue when testing my model using a config.yaml file

Traceback (most recent call last):
  File "/projappl/bandeira/schnet_env/lib64/python3.12/site-packages/schnetpack/cli.py", line 179, in train
    trainer.test(model=task, datamodule=datamodule, ckpt_path="best")
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 775, in test
    return call._call_and_handle_interrupt(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 817, in _test_impl
    results = self._run(model, ckpt_path=ckpt_path)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1012, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1049, in _run_stage
    return self._evaluation_loop.run()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/loops/utilities.py", line 179, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 138, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
                                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 134, in __next__
    batch = super().__next__()
            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 61, in __next__
    batch = next(self.iterator)
            ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.12/site-packages/torch/utils/data/dataloader.py", line 708, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.12/site-packages/torch/utils/data/dataloader.py", line 1455, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.12/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_data
    data.reraise()
  File "/usr/local/lib64/python3.12/site-packages/torch/_utils.py", line 733, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/usr/local/lib64/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/projappl/bandeira/schnet_env/lib64/python3.12/site-packages/schnetpack/data/atoms.py", line 270, in __getitem__
    props = self._get_properties(
            ^^^^^^^^^^^^^^^^^^^^^
  File "/projappl/bandeira/schnet_env/lib64/python3.12/site-packages/schnetpack/data/atoms.py", line 339, in _get_properties
    row = conn.get(idx + 1)
          ^^^^^^^^^^^^^^^^^
  File "/projappl/bandeira/schnet_env/lib64/python3.12/site-packages/ase/db/core.py", line 531, in get
    raise KeyError('no match')
KeyError: 'no match'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I am using ase==3.25.0, schnetpack==2.1.1 and torch==2.6.0+cu124. Could you help me understand what is going on? I tried simply shifting row = conn.get(idx + 1) to row = conn.get(idx) and the model test worked. Can that be the source of issues in the future or cause a mismatch in the dataset indices?

Yours faithfully,

Jun 03 '25 09:06 MasterLucas

HI @MasterLucas,

it is hard to tell what is going on here based on this information, except that you seem to load an item from the database that does not exist. How did you generate the splits?

And could you please check the following:

min and max id for train, val and test
length of the database

The schnetpack split files are 0 based, but ase db is 1 based.

Jun 04 '25 08:06 stefaanhessmann