spark-esri
spark-esri copied to clipboard
Repo to demonstrate the usage of Apache Spark within a Jupyter notebook within ArcGIS Pro
Spark Esri
Project to demonstrate the usage of Apache Spark within a Jupyter notebook within ArcGIS Pro.
Notes
Oct 25, 2022 - Updated to support upcoming Pro 3.1. See SparkGeo2 notebook for integration with Apache Arrow :-)
Apr 12, 2022 - Running PySpark in Pro 2.9 requires the PYSPARK_PYTHON
environment variable to be set. It should point to the python.exe executable of your active conda environment, e.g., C:\Users\%USERNAME%\AppData\Local\ESRI\conda\envs\spark_esri\python.exe
. Defining CONDA_DEFAULT_ENV
is neither sufficient and nor necesary.
Dec 16, 2021 - Added check for env var SPARK_HOME
to override built-in spark. See instructions below.
Oct 30, 2021 - Pro 2.8 relies on the Windows registry to find the active conda environment. The registry key is HKEY_CURRENT_USER/SOFTWARE/ESRI/ArcGISPro/PythonCondaEnv
. The value of this key is used to set the required os environment variable PYSPARK_PYTHON
for PySpark to work correctly in a Pro notebook.
As of this writing, the order to detect the active conda environment is as follows:
- look for env var
CONDA_DEFAULT_ENV
. - look for
%LOCALAPPDATA%/ESRI/conda/envs/proenv.txt
, in case of an older Pro version. - look for
HKEY_CURRENT_USER/SOFTWARE/ESRI/ArcGISPro/PythonCondaEnv
.
Oct 27, 2021 - Pro 2.8.3 removed the reliance and existence of the file %LOCALAPPDATA%/ESRI/conda/envs/proenv.txt
. It now depend on env var CONDA_DEFAULT_ENV
to determine the activate conda env.
~~Sep 16, 2021 - Perform the following as a patch for Pro 2.8.3~~
cd c:\
git clone https://github.com/kontext-tech/winutils
~~Define a system environment variable HADOOP_HOME
with value C:\winutils\hadoop-3.3.0
and add to system variable PATH
the %HADOOP_HOME%/bin
value.~~
~~NOTE: This works in Pro 2.6 ONLY. There is a small "issue" with Pro 2.7 and pyarrow. The folks in Redlands have a fix that will be in 2.8 :-(~~
Installation
Install Spark (Optional).
If you do not wish to use Pro's built-in Spark, you can download and install Spark 3.x separately. For example, download spark-3.2.1-bin-hadoop3.2.tgz and set the environment variable SPARK_HOME
to the folder where you extracted the archive. It's best to avoid spaces in the folder path.
Create a new Pro Conda Environment.
Start a Python Command Prompt
:
Note: You might need to add proxy settings to .condarc
located in C:\Program Files\ArcGIS\Pro\bin\Python
.
conda config --set proxy_servers.http http://username:password@host:port
conda config --set proxy_servers.https https://username:password@host:port
The above will produce something like the below:
ssl_verify: true
proxy_servers:
http: http://domainname\username:password@host:port
https: http://domainname\username:password@host:port
Create a new conda environment:
proswap arcgispro-py3
conda remove --yes --all --name spark_esri
conda create --yes --name spark_esri --clone arcgispro-py3
proswap spark_esri
Optional:
pip install fsspec==2021.8.1 boto3==1.18.35 s3fs==0.4.2 pyarrow==1.0.1
conda install --yes -c esri -c conda-forge -c default^
"numba=0.53.*"^
"pandas=1.2.*"^
"pyodbc=4.0.*"^
"gcsfs=0.7.*"
Install the Esri Spark module.
Note: You might need to install Git for Windows.
git clone https://github.com/mraad/spark-esri.git
cd spark-esri
python setup.py install
Spatial Binning Notebook
MicroPathing Notebook
Please note the usage of the range slider on the map to filter the micropaths between a user defined hour of day.
Virtual Gate Crossings Notebook
The following is the resulting crossing points and gates statistics.
Remote Execution on MS Azure Databricks Notebook
Predict Taxi Trip Durations, Map Taxi Trip Duration Errors Notebooks
TODO
- Unify spark_esri and spark_dbconnect python modules.
References
- https://github.com/kontext-tech/winutils
- https://github.com/cdarlint/winutils
- https://github.com/steveloughran/winutils
- https://www.geeksforgeeks.org/check-if-two-given-line-segments-intersect/
- https://www.kite.com/python/answers/how-to-check-if-two-line-segments-intersect-in-python
- https://pandas.pydata.org/pandas-docs/stable/development/extending.html
- https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
- https://www.esri.com/arcgis-blog/products/arcgis-pro/health/use-proximity-tracing-to-identify-possible-contact-events/
- https://marinecadastre.gov/ais/
- https://www.movable-type.co.uk/scripts/latlong.html
- https://www.kaggle.com/c/nyc-taxi-trip-duration/data
- https://developers.google.com/maps/documentation/utilities/polylinealgorithm
- https://nvidia.github.io/spark-rapids
- https://github.com/nvidia/spark-rapids
- https://github.com/quantopian/qgrid
- https://gist.github.com/rkaneko/dd2fae35149a29405d5e287ccd62677f Put parquet file on MinIO (S3 compatible storage) using pyarrow and s3fs
- https://towardsdatascience.com/installing-apache-pyspark-on-windows-10-f5f0c506bea1