kedro
kedro copied to clipboard
Render concrete parameter/return types for dataset `save`/`load`
Description
The parameter type for a dataset's save
is _DI
, and the return type for load
is _DO
. This doesn't make any sense to a user (and barely makes sense to me, as a Kedro developer, because I saw the typing PR go in and have that context). I'd rather it not display any types, if getting the correct concrete types is too difficult.

This would indeed be great to fix but I think only someone very brave should attempt it as I can see it being very frustrating and difficult to achieve... Do you have any idea how you could actually get Sphinx to do that without us having to manually override every single page of automatically generated docs for datasets? Possibly we can edit a higher-level rst template to do it?
To do: Try updating to the newest version of sphinx
and see if it has improved.
We are using a later version of Sphinx and this is still not resolved.
It may be something that @astrojuanlu has familiarity with but if it's a non-trivial fix I suggest we put it as a very low priority. Could we potentially have an explanation of what is going on somewhere so if people want to know what it means, they can search and find the reasoning?
Upgrading sphinx or sphinx-autodoc-typehints proved to create lots of headaches in the past. I agree the current result is less than ideal but I don't think this is trivial to fix at all. It might be easier to actually redesign the datasets...
It doesn't look much better on latest
with the spun-off kedro_datasets
https://docs.kedro.org/en/stable/kedro_datasets.pandas.CSVDataSet.html
The root cause is that the heavy lifting is done by _load
, which is a private method. Conversation about this is scattered around different issues (#1778, #1936, #2409 to name a few) but essentially I think the only sane way to fix this is to come up with a design that favours composition over inheritance, something that looks more similar to the proposal in https://github.com/kedro-org/kedro/issues/1936#issue-1410411315.
I think it's easier to just wait on a redesign of our catalog and datasets #1778.
Desired output
We can redefine load
and save
in each class to produce our desired result:
from pandas.util._decorators import doc # We could roll our own, simpler version
class LambdaDataset(AbstractDataset):
...
@doc(AbstractDataset.save.__doc__)
def save(self, data: int) -> None:
super().save(data)
@doc(AbstractDataset.load.__doc__)
def load(self) -> pd.DataFrame:
return super().load()
Note that clicking through to the code on the docs will show that generated code, so maybe we should make it explicit that this is a generated function in the comments:
Possible implementations
Now, the question is—how do we accomplish this for all datasets? It is possible to either:
- Actually add the
load
andsave
methods to each dataset implementation. There is probably a negligible performance hit from the added function call (since I/O is almost certainly the dominant part of any reasonable dataset). It's not the prettiest from a code perspective, though, where you'd rather rely on inheritance. - You can generate the additional methods just for docs generation.
- Is there another way to modify classes parsed by Sphinx before docs generation?
Notes
- I tried adding a
.pyi
file, but that didn't seem to have any effect. Can't figure out for certain whethersphinx_autodoc_typehints
looks at them (I don't think so). - https://github.com/tox-dev/sphinx-autodoc-typehints/issues/296 fixed
typing.overload
support, but is only available fromsphinx_autodoc_typehints==1.21.2
. While we probably shouldn't hard-pinsphinx_autodoc_typehints
inkedro-sphinx-theme
, and I personally am not sure what regression was the issue/if it may have been fixed in a future version (sphinx_autodoc_typehints
is on 2.x now), the specific issue resolved retyping.overload
(duplicate entries) does not seem to be the problem here. - Upon further investigation, maybe
typing.overload
doesn't have any value here to begin with, as you must specify the implementation after thetyping.overload
. At that point, we may as well just provide the implementation only in our case.
We discussed this on tech design today (2024-05-08). While there weren't clear arguments against this solution, we discussed alternative approaches:
- Given that this would mainly be the presentation layer, maybe it would be worth exploring a Sphinx extension instead.
- However, creating Sphinx extensions can be an interesting exercise.
- @deepyaman wasn't particularly motivated to pursue this approach but didn't close the door on somebody else doing it.
- This is probably something that wouldn't be useful to the broader community, given that this is a very specific situation caused by the design.
- We discussed about the datasets design (xref #1778)
- Of course we acknowledged that it will take us ~months to get there, so it would be nice to have an interim solution
- We started discussing whether we could get rid of the custom error handling in
AbstractDataSet
(see below) given that it's directly affecting this issue and it's making it difficult for users to debug their dataset errors https://github.com/kedro-org/kedro/issues/1936#issuecomment-1727172650 but we didn't have enough time to discuss
https://github.com/kedro-org/kedro/blob/c8a0e22464f2d229e26ec7fc138436c8863e3dcc/kedro/io/core.py#L190-L202
We discussed this on tech design today (2024-05-08). While there weren't clear arguments against this solution, we discussed alternative approaches:
Given that this would mainly be the presentation layer, maybe it would be worth exploring a Sphinx extension instead.
- However, creating Sphinx extensions can be an interesting exercise.
- @deepyaman wasn't particularly motivated to pursue this approach but didn't close the door on somebody else doing it.
- This is probably something that wouldn't be useful to the broader community, given that this is a very specific situation caused by the design.
We discussed about the datasets design (xref Re-design io.core and io.data_catalog #1778)
- Of course we acknowledged that it will take us ~months to get there, so it would be nice to have an interim solution
- We started discussing whether we could get rid of the custom error handling in
AbstractDataSet
(see below) given that it's directly affecting this issue and it's making it difficult for users to debug their dataset errors Easier CustomDataset Creation #1936 (comment) but we didn't have enough time to discusshttps://github.com/kedro-org/kedro/blob/c8a0e22464f2d229e26ec7fc138436c8863e3dcc/kedro/io/core.py#L190-L202
Thanks @astrojuanlu! Just copying in my summary, but I think you covered it:
It sounds like overriding
load
/save
may be workable, there’s a potential weekend project if somebody wants to write their own Sphinx plugin, but maybe the first followup is to see whether we can get rid of_load
and_save
altogether?
I have some ideas on the latter, and I will investigate this.