pytorch_geometric icon indicating copy to clipboard operation
pytorch_geometric copied to clipboard

Check `download`, `process` for class inheritance

Open pgmikhael opened this issue 3 years ago • 7 comments

Noticed the problem described in issue #4567 (closed) and saw the proposed fix. The fix (https://github.com/pyg-team/pytorch_geometric/pull/4586) doesn't check whether download or process exists, which the previous implementation does. Perhaps this change is closer to the intended behavior.

pgmikhael avatar Sep 16 '22 05:09 pgmikhael

Thanks for the PR. I assume that in this case, download is always called since it is present as a function at the base class. The intended behavior is that we only call it in case it is explicitly overridden by the user. I am not sure if this is covered in your PR. WDYT?

rusty1s avatar Sep 16 '22 05:09 rusty1s

Hi! Yep, download in this case would be called but, if the same raw files are used (same self.raw_file_names), then it should skip re-downloading? Beyond that, the fix in master also seems to always return True based on the name of the parent class as long as the name is not "Dataset," and doesn't seem to depend on the child class overwriting download. This is why the new fix confused me a bit. Thanks for taking a look!

pgmikhael avatar Sep 16 '22 13:09 pgmikhael

Hey @pgmikhael, thanks for looking at this.

Beyond that, the fix in master also seems to always return True based on the name of the parent class as long as the name is not "Dataset," and doesn't seem to depend on the child class overwriting

Yeah, this is the intended behavior in the fix.

Maybe this would be more clear if we just removed this fix altogether and changed the implementation in the base class of download from

    def download(self):
        r"""Downloads the dataset to the :obj:`self.raw_dir` folder."""
        raise NotImplementedError

to

    def download(self):
        r"""Downloads the dataset to the :obj:`self.raw_dir` folder."""
        pass

WDTY?

Padarn avatar Sep 17 '22 01:09 Padarn

Hey :) - the way I'm thinking about it is there are three desired behaviors:

  1. When creating a new geometric dataset, an error should be raised that download/process are not implemented as the data associated with the dataset won't be organized at all. So the NotImplementedError actually may be desired.
  2. When inheriting from a custom dataset, the user would expect that not overwriting a method like process means that the parent method will be used, as is the typical inheritance behavior. For instance, say my child class utilizes different files from its parent but otherwise wants to create the same graph structure.
  3. When we overwrite process, it should be executed once.

Since _download and _process already look at whether raw_paths and processed_paths exist, then always calling download/process may actually work (?)

pgmikhael avatar Sep 17 '22 04:09 pgmikhael

Hmm I don't disagree with the requirements, but I'm not super clear on which of them is violated now?

Padarn avatar Sep 17 '22 07:09 Padarn

Oh I realize now why the != 'Dataset is used instead of just calling them -- this may very well not need this PR I think. Apologies!

pgmikhael avatar Sep 17 '22 07:09 pgmikhael

No worries :-) Maybe you could update the test cases if you think that would make it more clear?

Padarn avatar Sep 17 '22 13:09 Padarn

This is now fixed.

rusty1s avatar Mar 20 '23 07:03 rusty1s