sparkmagic
sparkmagic copied to clipboard
Send data from local to Spark when using IPython kernel
Is your feature request related to a problem? Please describe. It is not possible to send data from local to Spark, when using IPython (as opposed to PySpark) kernel.
Describe the solution you'd like
Similar functionality as %%send_to_spark
in PySpark kernel but for IPython kernel.
Same question.
How do I send local python variables to the Livy PySpark session when running an IPython kernel?
The IPython magics are defined in this file https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/magics/remotesparkmagics.py
It does not look like a send_to_spark
has been implemented.
How to add the following to remotesparkmagics.py ?
https://github.com/jupyter-incubator/sparkmagic/blob/bfabbb39a0249197c2c05c8efe681710fff9151b/sparkmagic/sparkmagic/kernels/kernelmagics.py#L177
which refers to:
https://github.com/jupyter-incubator/sparkmagic/blob/bfabbb39a0249197c2c05c8efe681710fff9151b/sparkmagic/sparkmagic/magics/sparkmagicsbase.py#L51
My first guess is that it would be added as a subcommand
to %%spark
in remotesparkmagics.py with an additional @magic_arguments
, perhaps @argument("-v", "--variable", type=str, default=None, help="local variable to send to remote pyspark session.")
?
then add the subcommand
in another elif
block like elif subcommand == "send_to_spark":
similar to the following existing subcommand:
https://github.com/jupyter-incubator/sparkmagic/blob/bfabbb39a0249197c2c05c8efe681710fff9151b/sparkmagic/sparkmagic/magics/remotesparkmagics.py#L160
Until one of us can complete a pull request with the fix, you can use the work-around below.
Not as nice as a magic, but it works.
import json, requests
host = 'http://000.000.000.000:8998'
headers = {'Content-Type': 'application/json'}
sessions_url = f"{host}/sessions"
r1 = requests.get(sessions_url, headers=headers)
session_id = r1.json().get('sessions')[0].get('id')
statements_url = f"{sessions_url }/{session_id}/statements"
my_var = "test string to send"
var_name = "my_var "
var_val = repr(my_var )
pyspark_code = u'{} = {}'.format(var_name, var_val)
r2 = requests.post(statements_url, data=json.dumps({'code': pyspark_code}), headers=headers)
r2.json()
Then check from a %%spark
cell.
%%spark
my_var
output:
'test string to send'
Whereas before sending via post request, the output would have been
An error was encountered:
name 'my_var' is not defined
Traceback (most recent call last):
NameError: name 'my_var' is not defined
Any solutions for sending pandas dataframe?
Any solutions for sending pandas dataframe?
Clunky and not sure if it'd work, but if you can send a string, you should be able to convert the dataframe to json and then send the json output as a string. You can then reverse process on the other side. Again, clunky and you wouldn't want to do it for anything of substance.