PySpark Template Project

People has asked me several times how to setup a good/clean/code organization for Python project with PySpark. I didn't find a fully feature project, so this is my attempt for one. Moreover, have a simple integration with Jupyter Notebook inside the project too.

Table of Contents

PySpark Template Project
- Usage
- Inspiration
- Development
  - Prerequisites
  - Add format, lint code tools
    - Autolint/Format code with Black in IDE:
    - Checked optional type with Mypy PEP 484
    - Isort
- Fix
- Usage Local
- Use with poetry
- Usage in distributed-mode depending on your cluster manager type

Usage

Inspiration

https://mungingdata.com/pyspark/chaining-dataframe-transformations/
https://medium.com/albert-franzi/the-spark-job-pattern-862bc518632a
https://pawamoy.github.io/copier-poetry/
https://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure

Development

Prerequisites

All you need is the following configuration already installed:

The project was tested with Python 3.10.9 managed by pyenv:

pyenv prerequisites for ubuntu. Check the prerequisites for your OS.

sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

pyenv installed and available in path pyenv installation with Prerequisites

JAVA_HOME environment variable configured with a Java JDK11
SPARK_HOME environment variable configured with Spark version spark-3.3.1-bin-hadoop3 package
PYSPARK_PYTHON environment variable configured with "python3.10"
PYSPARK_DRIVER_PYTHON environment variable configured with "python3.10"
Install Make to run Makefile file
Why Python 3.10 because PySpark 3.3.1 doesn't work with Python 3.11 at the moment it seems (I haven't tried with Python 3.12)
Install python 3.10 with pyenv on homebrew/linuxbrew

CONFIGURE_OPTS="--with-openssl=$(brew --prefix openssl)" pyenv install 3.10

Add format, lint code tools

Autolint/Format code with Black in IDE:

Auto format via IDE https://github.com/psf/black#pycharmintellij-idea
[Optional] You could setup a pre-commit to enforce Black format before commit https://github.com/psf/black#version-control-integration
Or remember to type black . to apply the black rules formatting to all sources before commit
Add integratin with Jenkins and it will complain and tests will fail if black format is not applied
Add same mypy option for vscode in Preferences: Open User Settings
Use the option to lint/format with black and flake8 on editor save in vscode

Checked optional type with Mypy PEP 484

Configure Mypy to help annotating/hinting type with Python Code. It's very useful for IDE and for catching errors/bugs early.

Install mypy plugin for intellij

Adjust the plugin with the following options:

"--follow-imports=silent",
"--show-column-numbers",
"--ignore-missing-imports",
"--disallow-untyped-defs",
"--check-untyped-defs"

Documentation: Type hints cheat sheet (Python 3)
Add same mypy option for vscode in Preferences: Open User Settings

Isort

isort is the default on pycharm
isort with vscode
Lint/format/sort import on save with vscode in Preferences: Open User Settings:

{
    "editor.formatOnSave": true,
    "python.formatting.provider": "black",
    "[python]": {
        "editor.codeActionsOnSave": {
            "source.organizeImports": true
        }
    }
}

isort configuration for pycharm. See Set isort and black formatting code in pycharm
You can use make lint command to check flake8/mypy rules & apply automatically format black and isort to the code with the previous configuration

isort .

Fix

Show a way to treat json erroneous file like data/pubmed.json

Usage Local

Create a poetry env with python 3.10

poetry env use 3.10

Install dependencies in poetry env (virtualenv) make deps
Lint & Test make build
Lint,Test & Run make run
Run dev make dev
Build binary/python whell make dist

Use with poetry

poetry run drugs_gen --help

Usage: drugs_gen [OPTIONS]

Options:
  -d, --drugs TEXT             Path to drugs.csv
  -p, --pubmed TEXT            Path to pubmed.csv
  -c, --clinicals_trials TEXT  Path to clinical_trials.csv
  -o, --output TEXT            Output path to result.json (e.g
                               /path/to/result.json)
  --help                       Show this message and exit.

Usage in distributed-mode depending on your cluster manager type

Use spark-submit with the Python Wheel file build by make-dist in the dist folder.

pyspark-template
pyspark-template copied to clipboard

Metadata

PySpark Template Project

Usage

Inspiration

Development

Prerequisites

Add format, lint code tools

Autolint/Format code with Black in IDE:

Checked optional type with Mypy PEP 484

Isort

Fix

Usage Local

Use with poetry

Usage in distributed-mode depending on your cluster manager type

← Metadata

Owner

Metadata

pyspark-template pyspark-template copied to clipboard

Metadata

PySpark Template Project

Usage

Inspiration

Development

Prerequisites

Add format, lint code tools

Autolint/Format code with Black in IDE:

Checked optional type with Mypy PEP 484

Isort

Fix

Usage Local

Use with poetry

Usage in distributed-mode depending on your cluster manager type

← Metadata

Owner

Metadata

pyspark-template
pyspark-template copied to clipboard