koalas
koalas copied to clipboard
Apply causes
I've been trying to use 'apply' to modify a column in a dataframe. This used to work with previous versions. For example,
df['col1'] = df['col1'].apply(lambda x: x)
df.head()
causes
An error occurred while calling o7973.getResult.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.api.python.PythonServer.getResult(PythonRDD.scala:874)
at org.apache.spark.api.python.PythonServer.getResult(PythonRDD.scala:870)
at sun.reflect.GeneratedMethodAccessor331.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 169.0 failed 4 times, most recent failure: Lost task 4.3 in stage 169.0 (TID 2036, ip-10-0-0-200.ec2.internal, executor 2): java.lang.IllegalArgumentException
Hi @kaleyneufeld, Could you share the reproducible example so that we can investigate? Thanks!
Just my guts but I think it's because ARROW_PRE_0_15_IPC_FORMAT
is not set (see also https://github.com/databricks/koalas#getting-started)
@ueshin here is some example code:
df = ks.DataFrame([1,2,3], columns=['col1'])
df.head()
df['col1'] = df['col1'].apply(lambda x: x)
df.head()
@HyukjinKwon I just tried setting ARROW_PRE_0_15_IPC_FORMAT
to 1 and still got the same error message
@kaleyneufeld Thanks for sharing it.
I tried it with Koalas 1.2.0
, 1.3.0
, and the latest master, but it's working as expected.
What are the versions of Koalas, Spark, and PyArrow?
@ueshin I'm using Spark 2.4.5, Koalas 1.3.0, and PyArrow 2.0.0
In that case, I'd also suspect ARROW_PRE_0_15_IPC_FORMAT
is not set properly.
Could you try:
import os
os.environ.get('ARROW_PRE_0_15_IPC_FORMAT', 'None')
and
from pyspark.sql.functions import udf
@udf('string')
def check(x):
return os.environ.get('ARROW_PRE_0_15_IPC_FORMAT', 'None')
spark.range(1).select(check('id')).head()[0]
Both should return '1'
.
Also, what if with an older PyArrow than 0.15.0
, like 0.14.1
?