core [Epic] Automated Data Science bootstrap from curated content sets

Problem statement

As a Data Scientist, I want a service that provides me an easy mechanism to bootstrap a new Data Science project, starting from a curated software stack that is appropriate for my project’s goals and available in a shared environment, so that I can quickly start working on the Data Science project tasks without having to invest time in preparing a working environment, and I can be confident that the project is reproducible and maintainable.

High-level Goals

Starting a new Data Science project from scratch, user interacts with a git forge to obtain a git repository populated from a relevant curated software stack, with bots that keep it up to date with recommendations and make the git project readily available to start working on the Data Science tasks.

This involves:

A catalog of curated software stacks. Currently we have predictable stacks for Image Processing, Computer Vision and Natural Language Processing)
Template repositories that can be used to bootstrap the DS project
A "bootstrap" command for the bot
Pipelines that create a working build of the project content
(optional) an online Open Data Hub environment that hosts a running version of the project

Proposal description

Phase 1

As a Data Scientist, I want to be able to bootstrap a new GitHub repository from an existing template that contains a curated software stack that is relevant to my project.

User is pointed to the relevant template repository
User initializes a repo from the template
The user's new repo contains a ready-to-use software stack with clearly documented next steps
User installs Kebechet in the repo so that it receives automated PRs in the future with update recommendations

Phase 1.5 is: automate phase 1 with a script.

Phase 2

As a Data Scientist I want to open an Issue "please create an Image Processing notebook" on GitHub that triggers Thoth bot to start populating my repository:

User initializes an empty repo and installs Kebechet
User opens an issue in the repo, e.g. "New content set"
bot bootstraps repo
bot kicks off the Bring-Your-Own-Notebook workflow
user enters notebook spawner on ODH@op1st and sees the spawnable notebook image and can start it

Alternatives

User manually doing each step

Additional context

Acceptance Criteria

[ ] A service entry point / welcome page provides:
- [ ] A catalog of curated software stacks.
- [ ] Clear and concise instructions on how to use the service
- [ ] Additional documentation and references to the components involved (Thoth advise, pipelines, byon/odh...)
[ ] template repositories containing the curated software stacks:
- [ ] https://github.com/thoth-station/ps-nlp/issues/154
[ ] Tooling exists to streamline the creation of the new repo from the templates
[ ] A "bootstrap" command for the bot causes the repo to be pre-populated with the chosen stack content
[ ] Build pipelines create a working build of the new project content
[ ] (optional) an online Open Data Hub environment hosts a running version of the project

May 03 '22 11:05 codificat

/kind key-result

May 03 '22 11:05 codificat

We have discussed an approach that would reuse logic for template projects. Let's sync if we want to develop and maintain this type of logic in Kebechet.

May 03 '22 15:05 fridex

A "bootstrap" command for the bot causes the repo to be pre-populated with the chosen stack content

I suggest making this the first milestone. It's ok to assume the user knows which stacks are available. So that

I start with an empty repo
I open an issue and ask the bot to create a PR for e.g. the Image Recognition Stack
I can merge the PR and have a configured repo, which receives update recommendations via PRs in the future

does that make sense?

May 04 '22 08:05 durandom

/sig user-experience

May 05 '22 08:05 goern

A "bootstrap" command for the bot causes the repo to be pre-populated with the chosen stack content

I suggest making this the first milestone.

There is a pre-requisite to this, which is that the stack content is readily-usable. This fits with Frido's comment about template projects.

So that

I start with an empty repo

.. and install the bot.

I open an issue and ask the bot to create a PR for e.g. the Image Recognition Stack

I can merge the PR and have a configured repo,

These 3 steps can be potentially reduced to one by just using the template project logic.

As this is a pre-requisite anyway, I suggest we make that the first milestone. The bot automation can be a follow-up.

I updated the description a bit to reflect that, with the template logic being "phase 1".

Makes sense?

May 05 '22 10:05 codificat

Not sure how you plan to implement this, but sounds like it would require the addition of an ever growing set of template repo's (is that right?). Have you considered using a cookie-cutter [1] like repo that would serve as a single repo, with dynamic options that can be implemented on new repo creations based on the users specific needs?

You can also look at [2] our ds cookie cutter repo for a very simple example.

[1] https://github.com/cookiecutter/cookiecutter
[2] https://github.com/aicoe-aiops/cookiecutter-data-science

May 12 '22 17:05 MichaelClifford

Not sure how you plan to implement this, but sounds like it would require the addition of an ever growing set of template repo's (is that right?).

Correct, that was the idea: start initially with the 3 "predictable stacks" we have been working on, but eventually offer more options.

Have you considered using a cookie-cutter [1] like repo

I saw cookie-cutter and the work you are doing with it, and I was planning to look closer at it, starting with one of the repos (see mention of cookie-cutter/your template in https://github.com/thoth-station/ps-nlp/issues/154 as an option to explore).

But I was still thinking on separate repos. Honestly, this did not occur to me:

that would serve as a single repo, with dynamic options that can be implemented on new repo creations based on the users specific needs?

Thanks for the suggestion! I will look closer.

An initial question that comes to mind, though: wouldn't that single repo become too big/complex? e.g. the NLP stack alone already has 4 overlays. One goal of this functionality is to be simple, easy to understand - it is meant to bootstrap/get started, and I am wondering if we would potentially be over-complicating the starting point.

May 17 '22 21:05 codificat

wouldn't that single repo become too big/complex? e.g. the NLP stack alone already has 4 overlays.

Its certainly a trade off to consider. Managing 1 complex repo vs complexity of managing multiple simple repos. Again, depends on how you plan to implement this. Was just presenting a possible suggestion/ alternative.

Would it be as or more complex than https://github.com/operate-first/apps ?

May 18 '22 15:05 MichaelClifford

Its certainly a trade off to consider. Managing 1 complex repo vs complexity of managing multiple simple repos. Again, depends on how you plan to implement this. Was just presenting a possible suggestion/ alternative.

Yes, and thanks again for the suggestion, it is being considered.

Would it be as or more complex than https://github.com/operate-first/apps ?

It would not be as complex as that one, no.

/milestone OKR review Q2 2022

May 24 '22 13:05 codificat

/milestone OKR review Q2 2022

May 30 '22 15:05 codificat

/triage accepted /lifecycle active /assign

May 31 '22 11:05 codificat

/remove-lifecycle active as focus this quarter is in a different KR

Aug 23 '22 13:08 codificat