spark-privacy-preserver
spark-privacy-preserver copied to clipboard
How to run it?
hi @ThaminduR! Thank you for your work here!
I'm trying to repeat the examples using jupyter/pyspark-notebook:spark-2
docker container with PySpark 2.4.5 and Python 3.7.6 (as required in the readme) but have no success. I tried many things to run it but I got errors again and again.
Is there a way to have step by step guide or docker container for test the code?
What I did:
# Run container
docker run --rm -it --entrypoint /bin/bash jupyter/pyspark-notebook:spark-2
apt update && apt install -y git vim
pip install -U pip
# Install dependencies manually
pip install -U pandas>=1.1 pyarrow diffprivlib==0.2.1 tabulate==0.8.7 mypy>=0.770 kmodes
# Install `spark-privacy-preserver`
git clone https://github.com/ThaminduR/spark-privacy-preserver
cd spark-privacy-preserver
pip install --no-deps .
pyspark
# Run the code from mondrian_preserver demo.ipynb
The line:
dfn = Preserver.k_anonymize(df, k, feature_columns, sensitive_column, categorical, schema)
dfn.show()
Output:
ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 13) java.lang.IllegalArgumentException
Could you please help with environment setup and runnning? Thanks!