kedro
kedro copied to clipboard
How to make the documentations works for `kedro-datasets`?
Introduction
This is part of the effort of splitting out kedro.extras.datasets
to kedro-datasets
Background
How will docs work for kedro-datasets? We will need to generate documentation (at least the dataset part) from the kedro-datasets repo. What will be the CI/CD process to generate it?
There are also more subtle questions like:
- What does
latest
orstable
means when we have two repositories in the doc?
Potential Solutions
- Having a dedicated doc repository that add
kedro
andkedro-datasets
into a monorepo - http://fmarco76.github.io/git%20and%20related%20services/readthedocs. This is quite complicated but seems to be the more obvious choice when I Google around. - Split the doc into two ReadTheDocs projects - may be easier technically, but not ideal for user experience and diverge our visits into two places.
Both of these solutions are far from ideal, is there any popular libraries doing similar things?
Maybe we should take the opportunity to re-evaluate whether there's a better option than sphinx for us. Last time I checked it was tricky to find an alternative that did both markdown docs + API docs, but maybe worth investigating again. The current sphinx process is quite painful already, and I'm a bit worried this will make it even worse.
kedro datasets API documentation is definitely important though. I'd guess it's the most viewed part of our API docs by some margin.
My initial thought would be to keep docs in the kedro
repo. make build-docs
would then:
- either build docs in
kedro-datasets
and copy over the built version intokedro
built docs - or copy over the datasets from
kedro-datasets
and then build them onkedro
But I think both options will be a bit fiddly to get working (e.g. so sphinx links work correctly) 😬
Split the doc into two ReadTheDocs projects - may be easier technically, but not ideal for user experience and diverge our visits into two places.
Agree this would be technically easier but also not such a smooth user experience. I guess it wouldn't be so bad if we link to those docs from the kedro ones very clearly though. it would be easier to resolve the stable
vs. latest
question this way also.
@deepyaman any ideas here? You like fiddling round with docs 😀
Note this is old. See https://github.com/kedro-org/kedro/issues/1651#issuecomment-1240692700 for the latest scheme.
This was discussed in technical design on 13 July and the plan is as follows.
kedro-datasets
and kedro
docs should remain together on RTD as they do now. The only difference is that the API docs for kedro-datsets
will appear under kedro.datasets
rather than kedro.extras.datasets
. All the document-building takes place on the kedro
repo.
To achieve this we need the following flow when docs are built on kedro
:
- Copy the
kedro-datasets
files into adatasets
folder inkedro
(maybe clone just the relevant branch of the repo?) -
make build-docs
on the kedro repo should then work as it does now to generate API docs for both kedro core and kedro datasets - the
kedro-datasets
files should not remain on the kedro repo - they're just temporarily copied across for docs generation
The tricky part here is ensuring that the different latest
, stable
, etc. versions show the right thing on RTD.
Ideally this is what we should have:
-
latest
build corresponds tokedro
branchmain
andkedro-datasets
branchmain
-
stable
build corresponds tokedro
latest release andkedro-datasets
latest release - version tags like 0.18.1 just point to a fixed
stable
build. e.g. the version 0.18.1 would point tokedro
version 0.18.1 and whatever the latestkedro-datasets
release was when 0.18.1 was released. Version tags always give akedro
version; there are no tags forkedro-datasets
version.
Note this is still a bit confusing because we could release kedro-datasets
2.0 while kedro
points to kedro-datasets~=1.0
, and the docs will then show the latest kedro-datasets
2.0 documentation even though that's not what pip install kedro[...]
would be installing. I don't see any way around this though which doesn't cause other points of confusion.
To achieve this:
- when
kedro
has a commit tomain
, it re-builds the kedro docs usingkedro
branchmain
andkedro-datasets
branchmain
. This updates RTDlatest
build - when
kedro
has a release, it re-builds the kedro docs usingkedro
branchmain
(which is the same as the latest release) andkedro-datasets
latest release (notmain
). This updates RTDstable
and${KEDRO_VERSION}
build - when
kedro-datasets
has a commit tomain
, it re-builds the kedro docs usingkedro
branchmain
andkedro-datasets
branchmain
. This updates RTDlatest
build - when
kedro-datasets
has a release, it re-builds the kedro docs usingkedro
branch latest release (notmain
) andkedro-datasets
branchmain
(which is the same as the latest release). This updates RTDstable
and${KEDRO_VERSION}
build
For the bottom two points where kedro-datasets
needs to trigger a kedro
CI job, we have something similar setup to trigger test jobs on kedro-viz
whenever there's a new commit to main
in kedro
, so check how that works to see how to do this.
It would be nice to have a note of which version of docs for kedro-datasets
is being used in the kedro docs. Probably the easiest way to do this is through an appropriate docstring in kedro.datasets
. This gets rendered in the API docs as a description:
Possible catch/source of confusion: are we going to have both kedro.extras.datasets
and kedro-datasets
documentation in the kedro docs at the same time? Or should we just not render the kedro.extras.datasets
docs as soon as we include the kedro-datasets
ones? Note we will probably want to add some redirects to fix broken links anyway.
The tricky part here is ensuring that the different
latest
,stable
, etc. versions show the right thing on RTD.Ideally this is what we should have:
latest
build corresponds tokedro
branchmain
andkedro-datasets
branchmain
stable
build corresponds tokedro
latest release andkedro-datasets
latest release- version tags like 0.18.1 just point to a fixed
stable
build. e.g. the version 0.18.1 would point tokedro
version 0.18.1 and whatever the latestkedro-datasets
release was when 0.18.1 was released. Version tags always give akedro
version; there are no tags forkedro-datasets
version.Note this is still a bit confusing because we could release
kedro-datasets
2.0 whilekedro
points tokedro-datasets~=1.0
, and the docs will then show the latestkedro-datasets
2.0 documentation even though that's not whatpip install kedro[...]
would be installing. I don't see any way around this though which doesn't cause other points of confusion.
How about:
-
latest
build corresponds tokedro
branchmain
and latestkedro-datasets
compatible with latest Kedro--branchmain
ofkedro-datasets
, as long as the version onmain
is not bumped such that it's no longer compatible with the bounds of latest Kedro, in which case it would fall back -
stable
build corresponds tokedro
latest release andkedro-datasets
latest release compatible withkedro
latest release
More complicated, but more accurate, if you assume the entrypoint for users is through Kedro, not Kedro-Datasets.
Yeah, agree that is the even more accurate 👍 I just thought it might be more complicated than it was worth (given my "ideal" solution above is already kind of complicated). e.g. I think you would have to do something like this for latest
build:
- parse
extras_require
inkedro
branchmain
to get the version specifier ofkedro-datasets
(or we could just write this somewhere less buried I guess?) - get
kedro-datasets
branchmain
current version, check if it's compatible with that version specifier (needssemver
library or something like that to do the comparison?) - if it is then use that branch; otherwise work out what is the latest version of
kedro-datasets
that satisfies the version specifier (needs a call to PyPI?) - use the branch of
kedro-datasets
that corresponds to that
... which all sounds possible. I just don't know if it's worth the extra effort (or maybe there's a much easier way of doing it that I missed here).
Ah, also I just remembered what I thought one of the points of confusion here would be: the documentation for main
branch of kedro-datasets
doesn't necessarily get rendered anywhere. e.g. we're on kedro-datasets
2.0 and kedro
still has kedro-datasets~=1.0
. We update the docstrings of kedro-datasets
, people can use the library but they can't see the documentation anywhere without looking at the source code. This is ok because maybe we just say people who update to the latest kedro-datasets
are advanced users who don't need the built documentation, but it's a bit annoying. I don't know whether it's overall more or less annoying than my "ideal" solution.
Important note. Using a namespace package should make stage 1 (copy the kedro-datasets files into a datasets folder in kedro) much easier: you just pip install -t
into the right directory, and by the wonders of namespace packages it should just work. Probably it won't work in Sphinx without the __init__.py
files though...
Updated scheme following https://github.com/kedro-org/kedro/issues/1758#issuecomment-1234505832.
kedro-datasets
and kedro
docs should remain together on RTD as they do now. The only difference is that the API docs for kedro-datsets
will appear under kedro.datasets
rather than kedro.extras.datasets
. All the document-building takes place on the kedro
repo.
To achieve this we need the following flow when docs are built on kedro
:
- Copy the
kedro-datasets
files into adatasets
folder inkedro
(maybe clone just the relevant branch of the repo? Orpip install --no-deps -t
into the right directory is maybe easier?) -
make build-docs
on the kedro repo should then work as it does now to generate API docs for both kedro core and kedro datasets - the
kedro-datasets
files should not remain on the kedro repo - they're just temporarily copied across for docs generation
The tricky part here is ensuring that the different latest
, stable
, etc. versions show the right thing on RTD.
Ideally this is what we should have:
-
latest
build corresponds tokedro
branchmain
andkedro-datasets
branchmain
-
stable
build corresponds tokedro
latest release andkedro-datasets
latest release - version tags like 0.18.1 just point to a fixed
stable
build. e.g. the version 0.18.1 would point tokedro
version 0.18.1 and whatever the latestkedro-datasets
release was when 0.18.1 was released. Version tags always give akedro
version; there are no tags forkedro-datasets
version.
To achieve this:
- when
kedro
has a commit tomain
, it re-builds the kedro docs usingkedro
branchmain
andkedro-datasets
branchmain
. This updates RTDlatest
build - when
kedro
has a release, it re-builds the kedro docs usingkedro
branchmain
(which is the same as the latest release) andkedro-datasets
latest release (notmain
). This updates RTDstable
and${KEDRO_VERSION}
build - when
kedro-datasets
has a commit tomain
, it re-builds the kedro docs usingkedro
branchmain
andkedro-datasets
branchmain
. This updates RTDlatest
build - when
kedro-datasets
has a release, it re-builds the kedro docs usingkedro
branch latest release (notmain
) andkedro-datasets
branchmain
(which is the same as the latest release). This updates RTDstable
and${KEDRO_VERSION}
build
For the bottom two points where kedro-datasets
needs to trigger a kedro
CI job, we have something similar setup to trigger test jobs on kedro-viz
whenever there's a new commit to main
in kedro
, so check how that works to see how to do this.
It would be nice to have a note of which version of docs for kedro-datasets
is being used in the kedro docs. Probably the easiest way to do this is through an appropriate docstring in kedro.datasets
. This gets rendered in the API docs as a description:
Possible catch/source of confusion: are we going to have both kedro.extras.datasets
and kedro-datasets
documentation in the kedro docs at the same time? Or should we just not render the kedro.extras.datasets
docs as soon as we include the kedro-datasets
ones? Note we will probably want to add some redirects to fix broken links anyway.
The above solution was discussed in Tech Design on 14/9 and everyone agreed that it sounds like a good plan. We should:
- Ensure that as much of the setup as possible is automated
- Do an experiment to see if we can "have a note of which version of docs for
kedro-datasets
is being used in the kedro docs." Above this is described as a nice to have, but in the discussion we agreed this is quite important and we should make an effort to make it work.
Related to #1497 - Adding this here so we don't forget to update it once we decide on how to make docs work for kedro-datasets
.
- There is a note on Guidelines for contributing developers which I'm moving to Custom datasets in PR #1991 which will need to be updated according to how we decide to deal with API documentation
There are two special considerations when contributing a dataset:
1. Add the dataset to `kedro.extras.datasets.rst` so it shows up in the API documentation.
2. Add the dataset to `static/jsonschema/kedro-catalog-X.json` for IDE validation.
Notes on how to copy kedro-datasets
to kedro
:
To copy kedro-datasets
main use this command:
pip install --no-deps -t TARGET_DIRECTORY git+https://github.com/kedro-org/kedro-plugins.git#subdirectory=kedro-datasets
To copy kedro-datasets
latest release use this command:
pip install --no-deps -t TARGET_DIRECTORY kedro-datasets
@SajidAlamQB For the kedro-datasets
release - I imagine it would be the same as Workflow 2?
@noklam I thought so too but I was also thinking we would use the kedro's release workflow for workflow 2 which wouldn't make sense to do when kedro-datasets
has a new release.
The current problem we are facing is that ReadTheDocs checks out Kedro
main
branch and build the docs directly. For this issue, we need to copy kedro-datasets
into kedro.datasets
I managed to do this on our CI, but that has no effect on RTD. CI has 2 ways of publishing to RTD, sync
and publish_kedro
jobs, and both just trigger the checkout by RTD from Kedro
branch main
.
So how are we meant to copy over kedro-datasets
so that ReadTheDocs can use it?
3 possible solutions:
Solution 1: Commit way: We can trigger a commit with kedro-datasets
through the CI whenever kedro
/kedro-datasets
has a release or update and remove it after with a following commit. We could do this on a separate branch or on main
. If we decide to have it on a separate branch we would need to update the ReadTheDocs admin panel to use that branch instead.
Solution 2: .readthedocs.yml
hooks: The YAML file for RTD provides hooks to inject some code at certain sequences during the RTD builds. But since it's only YAML no way to add conditional statements for complex logic to differentiate when it's a kedro
/kedro-datasets
release or commit.
Solution 3: sphinx conf.py
: we can add some logic in the conf.py
file to do this; the only issue again is how we would know if it's a kedro
/kedro-datasets
commit or release.
The above 3 solutions were discussed in the technical design session, and we have decided to:
-
Implement solution 2, use
.readthedocs.yml
hooks to copy over ONLY thelatest release
ofkedro-datasets
tokedro.datasets
just before the document generation step on RTD. This means that ourlatest
andstable
tags will be the same for thekedro-datasets
API docs. We agreed on this being a temporary solution as we will be moving away from RTD and Sphinx. We'll mitigate the issue of thekedro-datasets
docs falling behind by having frequent releases. -
We will copy just
latest
release ofkedro-datasets
on CI, and test that the docs still pass.
Completed in https://github.com/kedro-org/kedro/pull/2006