coffee_boat
coffee_boat copied to clipboard
☕⛵WIP PySpark dependency management
https://github.com/pantsbuild/pex
For now we leave it out since many providers won't support it right now. Long story, buy me a :coffee: .
Right now we do some terrible things with overriding the PYTHON_PATH, which is great and works in the general case. If the Spark+K8 folks end up integrating better first party...
We currently have one example notebook, would be good to update the example to distribute PyArrow since this will be useful in Spark 2.3+ for vectorized UDF users.
In theory most of what we do is with add files in Spark which should be handled, but the decompressed directory I'm less certain about. We should investigate this.
Write now we create a bunch of temp files but don't really clean them up. There is a flag to do part of this but it needs to be tested...