papermill
papermill copied to clipboard
schedule run notebook example
I need to schedule running my notebook scripts on the first day of every month. Trying to follow the instruction in parameter but still could not follow. How should I set up the parameters in my case? From the example provided, it only mentions "alpha" and "ratio". What do they mean? Do I need to stick with these two variables for scheduling. and how do I make them represent " first day of every month"?
Thank you for your time.
So papermill doesn't do scheduling by itself. Instead think of it as a tool for executing notebooks that's easy to pass information into.
In the provided examples alpha and ratio are just names of those inputs for that situation. You can pass any parameter name with any value into the notebook. Say you wanted to execute a notebook and pass the current date into it. You might call (assuming you're on Linux or Mac)
papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`
This would inject a variable called "today" into your notebook with a value of 20190226 (as of writing this post).
To schedule this execution you can try following these directions on using crontab. This will show you how to run the script above on a schedule. To run on the first day of every month you'd add this to your crontab:
0 0 1 * * papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`
Hope that helps!
So papermill doesn't do scheduling by itself. Instead think of it as a tool for executing notebooks that's easy to pass information into.
In the provided examples alpha and ratio are just names of those inputs for that situation. You can pass any parameter name with any value into the notebook. Say you wanted to execute a notebook and pass the current date into it. You might call (assuming you're on Linux or Mac)
papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`This would inject a variable called "today" into your notebook with a value of 20190226 (as of writing this post).
To schedule this execution you can try following these directions on using crontab. This will show you how to run the script above on a schedule. To run on the first day of every month you'd add this to your crontab:
0 0 1 * * papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`Hope that helps!
Thank you for your comments! It makes more sense now. I'm using a windows laptop. I heard that crontab is not available for windows. Could you suggest any other method?
https://stackoverflow.com/questions/132971/what-is-the-windows-version-of-cron links a few options depending on your OS version.
I use a combination of Apache Airflow (https://airflow.apache.org/) and Papermill for very complex tasks that are scheduled and it works REALLY well. You'll need to write your own handler, an example could be:
import os
import papermill as pm
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def execute_python_notebook_task(**context):
notebook_path = context['notebook_path']
out_path = context['out_path']
out_dir = os.path.dirname(out_path)
statement_parameters = context['statement_parameters'] if 'statement_parameters' in context else None
if not os.path.exists(out_dir):
os.makedirs(out_dir)
if callable(statement_parameters):
statement_parameters = statement_parameters(context)
pm.execute_notebook(
notebook_path,
out_path,
parameters=statement_parameters
)
seven_days_ago = datetime.combine(
datetime.today() - timedelta(7),
datetime.min.time()
)
default_args = {
'owner': 'airflow',
'start_date': seven_days_ago,
'provide_context': True,
}
dag_name = 'runnin_notebooks_yo'
schedule_interval = '@monthly'
with DAG(dag_name, default_args=default_args, schedule_interval=schedule_interval) as dag:
run_some_notebook_task = PythonOperator(
task_id='run_some_notebook_task',
python_callable=execute_python_notebook_task,
op_kwargs={
'notebook_path': 'path_to_some_notebook.ipynb',
'out_path': 'path_to_some_notebook.out.ipynb',
'statement_parameters': {
'parameter_1': 'some_value'
}
}
)
Please note, Airflow is a pretty full featured tool which includes running branching dependencies of tasks, it may be overkill for what you want, but it is a pretty good tool for handling this sort of scheduling.
@mbrio @otterotter408 If I'm not mistaken, Apache Airflow is a pain to install on Windows.
🤷♂️https://stackoverflow.com/questions/32378494/how-to-run-airflow-on-windows
I can attest that the bash on windows approach works quite well for 99% of tasks (though I haven't tried airflow explicitly with this) :D
I use a combination of Apache Airflow (https://airflow.apache.org/) and Papermill for very complex tasks that are scheduled and it works REALLY well. You'll need to write your own handler, an example could be:
import os import papermill as pm from datetime import datetime, timedelta from airflow import DAG from airflow.operators.python_operator import PythonOperator def execute_python_notebook_task(**context): notebook_path = context['notebook_path'] out_path = context['out_path'] out_dir = os.path.dirname(out_path) statement_parameters = context['statement_parameters'] if 'statement_parameters' in context else None if not os.path.exists(out_dir): os.makedirs(out_dir) if callable(statement_parameters): statement_parameters = statement_parameters(context) pm.execute_notebook( notebook_path, out_path, parameters=statement_parameters ) seven_days_ago = datetime.combine( datetime.today() - timedelta(7), datetime.min.time() ) default_args = { 'owner': 'airflow', 'start_date': seven_days_ago, 'provide_context': True, } dag_name = 'runnin_notebooks_yo' schedule_interval = '@monthly' with DAG(dag_name, default_args=default_args, schedule_interval=schedule_interval) as dag: run_some_notebook_task = PythonOperator( task_id='run_some_notebook_task', python_callable=execute_python_notebook_task, op_kwargs={ 'notebook_path': 'path_to_some_notebook.ipynb', 'out_path': 'path_to_some_notebook.out.ipynb', 'statement_parameters': { 'parameter_1': 'some_value' } } )Please note, Airflow is a pretty full featured tool which includes running branching dependencies of tasks, it may be overkill for what you want, but it is a pretty good tool for handling this sort of scheduling.
Airflow now have a PapermillOperator :) Airflow Papermill Operator