luigi-sample icon indicating copy to clipboard operation
luigi-sample copied to clipboard

Sample repo for luigi tasks & config

Luigi Sample Repository

There are lot of resources online explaining the use cases of Luigi, but not a lot of them explain how to setup and configure Luigi. Here, we have different Luigi tasks explaining the use cases and setup

Installation

Only package we need is luigi

$ pip install luigi

Tasks

Each task is a unit of work. Tasks can be independant or can have dependancies

  • Simple Hello World task - Simple example to show the structure of a Luigi task
  • Task with dependecy - Example to show how to add dependencies to current task
  • Task with mock target - I really do not have an output, but still want some task dependency
  • PySpark Task - Example pyspark task with JDBC connection
Note:

Every task mandatorily needs to have an output, else luigi does not know if the task is completed or not. It is more like a book keeping for Luigi to find out the status of the task.

If you remove output() method from any of the dependant task, the dependant task will run infinitely. So make sure you have the output method specified and is used to write output in all tasks.

Running the tasks

To run any task, we follow the below command pattern

$ luigi --module mymodule MyTask --local-scheduler

Example:

$ luigi --module dependant_task DependantTask --local-scheduler

Configuration

Luigi looks for config files in:

  • /etc/luigi/client.cfg
  • luigi.cfg in the current working directory
  • LUIGI_CONFIG_PATH environment variable

Most important part of the configuration is setting up a spark job or a pyspark job. Luigi config has sections [spark] and [pyspark] to specify extra JARs and drivers required to run the spark job.

Sample config

Central Scheduler

The --local-scheduler param to run the luigi module must be used only during development. Once deployed, we need to use the central scheduler.

Running the central scheduler:

$ luigid

When luigid starts up, it looks for the config file in previously mentioned locations.

Running a luigi task with the central scheduler

$ luigi --module pyspark_task PySparkTableSchema

Web Dashboard

Luigi comes with a web dashboard for task history and statuses. The dashboard can be accessed via http://localhost:8082 for local instance. The central server's dashboard can be accessed via http://<central-scheduler-host>:8082