spark-esri icon indicating copy to clipboard operation
spark-esri copied to clipboard

Repo to demonstrate the usage of Apache Spark within a Jupyter notebook within ArcGIS Pro

Spark Esri

Project to demonstrate the usage of Apache Spark within a Jupyter notebook within ArcGIS Pro.

Notes

Oct 25, 2022 - Updated to support upcoming Pro 3.1. See SparkGeo2 notebook for integration with Apache Arrow :-)

Apr 12, 2022 - Running PySpark in Pro 2.9 requires the PYSPARK_PYTHON environment variable to be set. It should point to the python.exe executable of your active conda environment, e.g., C:\Users\%USERNAME%\AppData\Local\ESRI\conda\envs\spark_esri\python.exe. Defining CONDA_DEFAULT_ENV is neither sufficient and nor necesary.

Dec 16, 2021 - Added check for env var SPARK_HOME to override built-in spark. See instructions below.

Oct 30, 2021 - Pro 2.8 relies on the Windows registry to find the active conda environment. The registry key is HKEY_CURRENT_USER/SOFTWARE/ESRI/ArcGISPro/PythonCondaEnv. The value of this key is used to set the required os environment variable PYSPARK_PYTHON for PySpark to work correctly in a Pro notebook.

As of this writing, the order to detect the active conda environment is as follows:

  • look for env var CONDA_DEFAULT_ENV.
  • look for %LOCALAPPDATA%/ESRI/conda/envs/proenv.txt, in case of an older Pro version.
  • look for HKEY_CURRENT_USER/SOFTWARE/ESRI/ArcGISPro/PythonCondaEnv.

Oct 27, 2021 - Pro 2.8.3 removed the reliance and existence of the file %LOCALAPPDATA%/ESRI/conda/envs/proenv.txt. It now depend on env var CONDA_DEFAULT_ENV to determine the activate conda env.

~~Sep 16, 2021 - Perform the following as a patch for Pro 2.8.3~~

cd c:\
git clone https://github.com/kontext-tech/winutils

~~Define a system environment variable HADOOP_HOME with value C:\winutils\hadoop-3.3.0 and add to system variable PATH the %HADOOP_HOME%/bin value.~~

~~NOTE: This works in Pro 2.6 ONLY. There is a small "issue" with Pro 2.7 and pyarrow. The folks in Redlands have a fix that will be in 2.8 :-(~~

Installation

Install Spark (Optional).

If you do not wish to use Pro's built-in Spark, you can download and install Spark 3.x separately. For example, download spark-3.2.1-bin-hadoop3.2.tgz and set the environment variable SPARK_HOME to the folder where you extracted the archive. It's best to avoid spaces in the folder path.

Create a new Pro Conda Environment.

Start a Python Command Prompt:

Note: You might need to add proxy settings to .condarc located in C:\Program Files\ArcGIS\Pro\bin\Python.

conda config --set proxy_servers.http http://username:password@host:port
conda config --set proxy_servers.https https://username:password@host:port

The above will produce something like the below:

ssl_verify: true
proxy_servers:
  http: http://domainname\username:password@host:port
  https: http://domainname\username:password@host:port

Create a new conda environment:

proswap arcgispro-py3
conda remove --yes --all --name spark_esri
conda create --yes --name spark_esri --clone arcgispro-py3
proswap spark_esri

Optional:

pip install fsspec==2021.8.1 boto3==1.18.35 s3fs==0.4.2 pyarrow==1.0.1

conda install --yes -c esri -c conda-forge -c default^
    "numba=0.53.*"^
    "pandas=1.2.*"^
    "pyodbc=4.0.*"^
    "gcsfs=0.7.*"        

Install the Esri Spark module.

Note: You might need to install Git for Windows.

git clone https://github.com/mraad/spark-esri.git
cd spark-esri
python setup.py install

Spatial Binning Notebook

MicroPathing Notebook

Please note the usage of the range slider on the map to filter the micropaths between a user defined hour of day.

Virtual Gate Crossings Notebook

The following is the resulting crossing points and gates statistics.

Remote Execution on MS Azure Databricks Notebook

Predict Taxi Trip Durations, Map Taxi Trip Duration Errors Notebooks

TODO

  • Unify spark_esri and spark_dbconnect python modules.

References

  • https://github.com/kontext-tech/winutils
  • https://github.com/cdarlint/winutils
  • https://github.com/steveloughran/winutils
  • https://www.geeksforgeeks.org/check-if-two-given-line-segments-intersect/
  • https://www.kite.com/python/answers/how-to-check-if-two-line-segments-intersect-in-python
  • https://pandas.pydata.org/pandas-docs/stable/development/extending.html
  • https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
  • https://www.esri.com/arcgis-blog/products/arcgis-pro/health/use-proximity-tracing-to-identify-possible-contact-events/
  • https://marinecadastre.gov/ais/
  • https://www.movable-type.co.uk/scripts/latlong.html
  • https://www.kaggle.com/c/nyc-taxi-trip-duration/data
  • https://developers.google.com/maps/documentation/utilities/polylinealgorithm
  • https://nvidia.github.io/spark-rapids
  • https://github.com/nvidia/spark-rapids
  • https://github.com/quantopian/qgrid
  • https://gist.github.com/rkaneko/dd2fae35149a29405d5e287ccd62677f Put parquet file on MinIO (S3 compatible storage) using pyarrow and s3fs
  • https://towardsdatascience.com/installing-apache-pyspark-on-windows-10-f5f0c506bea1