Check `download`, `process` for class inheritance
Noticed the problem described in issue #4567 (closed) and saw the proposed fix. The fix (https://github.com/pyg-team/pytorch_geometric/pull/4586) doesn't check whether download or process exists, which the previous implementation does. Perhaps this change is closer to the intended behavior.
Thanks for the PR. I assume that in this case, download is always called since it is present as a function at the base class. The intended behavior is that we only call it in case it is explicitly overridden by the user. I am not sure if this is covered in your PR. WDYT?
Hi! Yep, download in this case would be called but, if the same raw files are used (same self.raw_file_names), then it should skip re-downloading? Beyond that, the fix in master also seems to always return True based on the name of the parent class as long as the name is not "Dataset," and doesn't seem to depend on the child class overwriting download. This is why the new fix confused me a bit. Thanks for taking a look!
Hey @pgmikhael, thanks for looking at this.
Beyond that, the fix in master also seems to always return True based on the name of the parent class as long as the name is not "Dataset," and doesn't seem to depend on the child class overwriting
Yeah, this is the intended behavior in the fix.
Maybe this would be more clear if we just removed this fix altogether and changed the implementation in the base class of download from
def download(self):
r"""Downloads the dataset to the :obj:`self.raw_dir` folder."""
raise NotImplementedError
to
def download(self):
r"""Downloads the dataset to the :obj:`self.raw_dir` folder."""
pass
WDTY?
Hey :) - the way I'm thinking about it is there are three desired behaviors:
- When creating a new geometric dataset, an error should be raised that
download/processare not implemented as the data associated with the dataset won't be organized at all. So theNotImplementedErroractually may be desired. - When inheriting from a custom dataset, the user would expect that not overwriting a method like
processmeans that the parent method will be used, as is the typical inheritance behavior. For instance, say my child class utilizes different files from its parent but otherwise wants to create the same graph structure. - When we overwrite
process, it should be executed once.
Since _download and _process already look at whether raw_paths and processed_paths exist, then always calling download/process may actually work (?)
Hmm I don't disagree with the requirements, but I'm not super clear on which of them is violated now?
Oh I realize now why the != 'Dataset is used instead of just calling them -- this may very well not need this PR I think. Apologies!
No worries :-) Maybe you could update the test cases if you think that would make it more clear?
This is now fixed.