support
support copied to clipboard
Scheduling Notebooks
Problem
- How can I quickly go from experimentation (
.ipynb
) to production (typically.py
)?
Current Solution
The prevailing method of "productioniz'ing" notebooks is,
- Convert notebooks to python scripts
- Clean up the script, write some tests, get a code review done
- Setup cloud machine, install all the libraries (or dockerize the script)
- Run the script manually or setup a crontab
Is this really efficient?
Challenging the Status Quo
What if we run notebooks directly for our production workflows? Here are some benefits,
- Rich output for each execution (notebook itself!)
- Quickly go from experimentation to production. No time spent in extracting code from
.ipynb
- Failed workflows are easy to debug (thanks to the rich notebook output)
Why do we really need to convert notebooks to python scripts? Here are a few common objections (I'd love to learn more in comments),
-
Code Review - We can review notebooks directly with ReviewNB & nbdime (
.py
is not necessary). -
Testing - We can directly write tests for notebook code with Treon and a few other tools (
.py
is not necessary here either). - Code reuse - This is a legit reason. You should definitely convert notebook code into libraries whenever possible. It makes reuse super easy and keeps the notebook readable. But we don't need to convert entire notebook into a script, do we? The final execution can easily be running a notebook that imports the libraries we created.
Proposed Solution
- You select a notebook from GitHub repo and set a schedule for it to run (once/daily/weekly etc.).
- You select the instance type (memory, vCPU) for execution.
- You can specify different parameters for each run via Papermill
- ReviewNB executes this notebook on your specified schedule & preserves the result of each run (as an executed notebook)
- ReviewNB supports notebook workflows (parallel executions for different parameters, result of one notebook feeds into the next etc.)
- For environment, we use stable versions of commonly used DS libraries. User can specify their own environment as well (via dockerfile)
Motivation
FAQ
-
Can we run notebooks on our own hardware? Absolutely. You can self host ReviewNB & hook it up to your own AWS/GCP account to execute notebooks on your own machines.
-
How will I specify sensitive data (e.g. DB credentials) required for execution? ReviewNB provides a prompt to set any sensitive data as environment variables that are available to notebook at runtime.
Feel free to upvote/downvote the issue indicating whether you think this is useful feature or not. I also welcome additional questions/comments/discussion on the issue.
This would be an amazing feature. We use notebooks in more than one aspect in our organization:
Data Science:
- Model Creation - probably won't be run on the cloud - they get translated to pure python first
- Validation: We need to run these every time we update our models.
Devops
One-off scripts (for DB migration, backfilling or any emergency ops) get written as notebooks into a /playbooks directory, they are reviewed on GH and then run locally right now. It would be very valuable to run this from a preset environment.
For any of these use cases, the permission and security model would dictate if we could use it as a part of our workflow.
Thank you @srossross
For any of these use cases, the permission and security model would dictate if we could use it as a part of our workflow.
I'm thinking of relying on GitHub permissions. E.g. All users who have read access on a private GitHub repository can also see all periodic jobs for that repository. All users who can write to that repository can also edit/create jobs for that repository. Would this work or do you need a separate permission system for jobs?
Validation: We need to run these every time we update our models.
How are you running these currently? (manually or automated jobs) Where are you running these currently? (locally or cloud)
Model Creation - probably won't be run on the cloud - they get translated to pure python first
Just curious, why not run these as notebooks as well? Are they not suitable for the notebook format?
I think fast.ai nbdevTemplate solves these problems