papermill icon indicating copy to clipboard operation
papermill copied to clipboard

schedule run notebook example

Open otterotter408 opened this issue 6 years ago • 8 comments
trafficstars

I need to schedule running my notebook scripts on the first day of every month. Trying to follow the instruction in parameter but still could not follow. How should I set up the parameters in my case? From the example provided, it only mentions "alpha" and "ratio". What do they mean? Do I need to stick with these two variables for scheduling. and how do I make them represent " first day of every month"?

Thank you for your time.

otterotter408 avatar Feb 26 '19 21:02 otterotter408

So papermill doesn't do scheduling by itself. Instead think of it as a tool for executing notebooks that's easy to pass information into.

In the provided examples alpha and ratio are just names of those inputs for that situation. You can pass any parameter name with any value into the notebook. Say you wanted to execute a notebook and pass the current date into it. You might call (assuming you're on Linux or Mac)

papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`

This would inject a variable called "today" into your notebook with a value of 20190226 (as of writing this post).

To schedule this execution you can try following these directions on using crontab. This will show you how to run the script above on a schedule. To run on the first day of every month you'd add this to your crontab:

0 0 1 * * papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`

Hope that helps!

MSeal avatar Feb 27 '19 03:02 MSeal

So papermill doesn't do scheduling by itself. Instead think of it as a tool for executing notebooks that's easy to pass information into.

In the provided examples alpha and ratio are just names of those inputs for that situation. You can pass any parameter name with any value into the notebook. Say you wanted to execute a notebook and pass the current date into it. You might call (assuming you're on Linux or Mac)

papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`

This would inject a variable called "today" into your notebook with a value of 20190226 (as of writing this post).

To schedule this execution you can try following these directions on using crontab. This will show you how to run the script above on a schedule. To run on the first day of every month you'd add this to your crontab:

0 0 1 * * papermill my_notebook.ipynb result.ipynb -p today `date + '%Y%m%d'`

Hope that helps!

Thank you for your comments! It makes more sense now. I'm using a windows laptop. I heard that crontab is not available for windows. Could you suggest any other method?

otterotter408 avatar Feb 28 '19 23:02 otterotter408

https://stackoverflow.com/questions/132971/what-is-the-windows-version-of-cron links a few options depending on your OS version.

MSeal avatar Mar 01 '19 00:03 MSeal

I use a combination of Apache Airflow (https://airflow.apache.org/) and Papermill for very complex tasks that are scheduled and it works REALLY well. You'll need to write your own handler, an example could be:

import os
import papermill as pm
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def execute_python_notebook_task(**context):
    notebook_path = context['notebook_path']
    out_path = context['out_path']
    out_dir = os.path.dirname(out_path)
    statement_parameters = context['statement_parameters'] if 'statement_parameters' in context else None

    if not os.path.exists(out_dir):
        os.makedirs(out_dir)

    if callable(statement_parameters):
        statement_parameters = statement_parameters(context)

    pm.execute_notebook(
        notebook_path,
        out_path,
        parameters=statement_parameters
    )

seven_days_ago = datetime.combine(
    datetime.today() - timedelta(7),
    datetime.min.time()
)

default_args = {
    'owner': 'airflow',
    'start_date': seven_days_ago,
    'provide_context': True,
}

dag_name = 'runnin_notebooks_yo'
schedule_interval = '@monthly'

with DAG(dag_name, default_args=default_args, schedule_interval=schedule_interval) as dag:
    run_some_notebook_task = PythonOperator(
        task_id='run_some_notebook_task',
        python_callable=execute_python_notebook_task,
        op_kwargs={
            'notebook_path': 'path_to_some_notebook.ipynb',
            'out_path': 'path_to_some_notebook.out.ipynb',
            'statement_parameters': {
                'parameter_1': 'some_value'
            }
        }
    )

Please note, Airflow is a pretty full featured tool which includes running branching dependencies of tasks, it may be overkill for what you want, but it is a pretty good tool for handling this sort of scheduling.

mbrio avatar Mar 13 '19 14:03 mbrio

@mbrio @otterotter408 If I'm not mistaken, Apache Airflow is a pain to install on Windows.

pybokeh avatar Mar 17 '19 15:03 pybokeh

🤷‍♂️https://stackoverflow.com/questions/32378494/how-to-run-airflow-on-windows

mbrio avatar Mar 20 '19 14:03 mbrio

I can attest that the bash on windows approach works quite well for 99% of tasks (though I haven't tried airflow explicitly with this) :D

MSeal avatar Mar 20 '19 20:03 MSeal

I use a combination of Apache Airflow (https://airflow.apache.org/) and Papermill for very complex tasks that are scheduled and it works REALLY well. You'll need to write your own handler, an example could be:

import os
import papermill as pm
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def execute_python_notebook_task(**context):
    notebook_path = context['notebook_path']
    out_path = context['out_path']
    out_dir = os.path.dirname(out_path)
    statement_parameters = context['statement_parameters'] if 'statement_parameters' in context else None

    if not os.path.exists(out_dir):
        os.makedirs(out_dir)

    if callable(statement_parameters):
        statement_parameters = statement_parameters(context)

    pm.execute_notebook(
        notebook_path,
        out_path,
        parameters=statement_parameters
    )

seven_days_ago = datetime.combine(
    datetime.today() - timedelta(7),
    datetime.min.time()
)

default_args = {
    'owner': 'airflow',
    'start_date': seven_days_ago,
    'provide_context': True,
}

dag_name = 'runnin_notebooks_yo'
schedule_interval = '@monthly'

with DAG(dag_name, default_args=default_args, schedule_interval=schedule_interval) as dag:
    run_some_notebook_task = PythonOperator(
        task_id='run_some_notebook_task',
        python_callable=execute_python_notebook_task,
        op_kwargs={
            'notebook_path': 'path_to_some_notebook.ipynb',
            'out_path': 'path_to_some_notebook.out.ipynb',
            'statement_parameters': {
                'parameter_1': 'some_value'
            }
        }
    )

Please note, Airflow is a pretty full featured tool which includes running branching dependencies of tasks, it may be overkill for what you want, but it is a pretty good tool for handling this sort of scheduling.

Airflow now have a PapermillOperator :) Airflow Papermill Operator

yosefbs avatar Sep 13 '21 09:09 yosefbs