pyspark-template
pyspark-template copied to clipboard
A Python PySpark Projet with Poetry
PySpark Template Project
People has asked me several times how to setup a good/clean/code organization for Python project with PySpark. I didn't find a fully feature project, so this is my attempt for one. Moreover, have a simple integration with Jupyter Notebook inside the project too.
Table of Contents
- PySpark Template Project
- Usage
- Inspiration
- Development
- Prerequisites
- Add format, lint code tools
- Autolint/Format code with Black in IDE:
- Checked optional type with Mypy PEP 484
- Isort
- Fix
- Usage Local
- Use with poetry
- Usage in distributed-mode depending on your cluster manager type
Usage
Inspiration
- https://mungingdata.com/pyspark/chaining-dataframe-transformations/
- https://medium.com/albert-franzi/the-spark-job-pattern-862bc518632a
- https://pawamoy.github.io/copier-poetry/
- https://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure
Development
Prerequisites
All you need is the following configuration already installed:
- Git
- The project was tested with Python 3.10.9 managed by pyenv:
- pyenv prerequisites for ubuntu. Check the prerequisites for your OS.
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \ libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \ libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-devpyenvinstalled and available in path pyenv installation with Prerequisites
JAVA_HOMEenvironment variable configured with a JavaJDK11SPARK_HOMEenvironment variable configured with Spark versionspark-3.3.1-bin-hadoop3packagePYSPARK_PYTHONenvironment variable configured with"python3.10"PYSPARK_DRIVER_PYTHONenvironment variable configured with"python3.10"- Install Make to run
Makefilefile - Why
Python 3.10becausePySpark 3.3.1doesn't work withPython 3.11at the moment it seems (I haven't tried with Python 3.12) - Install python 3.10 with pyenv on homebrew/linuxbrew
CONFIGURE_OPTS="--with-openssl=$(brew --prefix openssl)" pyenv install 3.10
Add format, lint code tools
Autolint/Format code with Black in IDE:
-
Auto format via IDE https://github.com/psf/black#pycharmintellij-idea
-
[Optional] You could setup a pre-commit to enforce Black format before commit https://github.com/psf/black#version-control-integration
-
Or remember to type
black .to apply the black rules formatting to all sources before commit -
Add integratin with Jenkins and it will complain and tests will fail if black format is not applied
-
Add same mypy option for vscode in
Preferences: Open User Settings -
Use the option to lint/format with black and flake8 on editor save in vscode
Checked optional type with Mypy PEP 484
Configure Mypy to help annotating/hinting type with Python Code. It's very useful for IDE and for catching errors/bugs early.
- Install mypy plugin for intellij
- Adjust the plugin with the following options:
"--follow-imports=silent", "--show-column-numbers", "--ignore-missing-imports", "--disallow-untyped-defs", "--check-untyped-defs" - Documentation: Type hints cheat sheet (Python 3)
- Add same mypy option for vscode in
Preferences: Open User Settings
Isort
- isort is the default on pycharm
- isort with vscode
- Lint/format/sort import on save with vscode in
Preferences: Open User Settings:
{
"editor.formatOnSave": true,
"python.formatting.provider": "black",
"[python]": {
"editor.codeActionsOnSave": {
"source.organizeImports": true
}
}
}
- isort configuration for pycharm. See Set isort and black formatting code in pycharm
- You can use
make lintcommand to check flake8/mypy rules & apply automatically format black and isort to the code with the previous configuration
isort .
Fix
- Show a way to treat json erroneous file like
data/pubmed.json
Usage Local
- Create a poetry env with python 3.10
poetry env use 3.10
- Install dependencies in poetry env (virtualenv)
make deps - Lint & Test
make build - Lint,Test & Run
make run - Run dev
make dev - Build binary/python whell
make dist
Use with poetry
poetry run drugs_gen --help
Usage: drugs_gen [OPTIONS]
Options:
-d, --drugs TEXT Path to drugs.csv
-p, --pubmed TEXT Path to pubmed.csv
-c, --clinicals_trials TEXT Path to clinical_trials.csv
-o, --output TEXT Output path to result.json (e.g
/path/to/result.json)
--help Show this message and exit.
Usage in distributed-mode depending on your cluster manager type
- Use
spark-submitwith the Python Wheel file build bymake-distin thedistfolder.