koalas icon indicating copy to clipboard operation
koalas copied to clipboard

Apply causes

Open kaleyneufeld opened this issue 4 years ago • 7 comments

I've been trying to use 'apply' to modify a column in a dataframe. This used to work with previous versions. For example,

df['col1'] = df['col1'].apply(lambda x: x) df.head()

causes

An error occurred while calling o7973.getResult.
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
	at org.apache.spark.api.python.PythonServer.getResult(PythonRDD.scala:874)
	at org.apache.spark.api.python.PythonServer.getResult(PythonRDD.scala:870)
	at sun.reflect.GeneratedMethodAccessor331.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 169.0 failed 4 times, most recent failure: Lost task 4.3 in stage 169.0 (TID 2036, ip-10-0-0-200.ec2.internal, executor 2): java.lang.IllegalArgumentException

kaleyneufeld avatar Oct 20 '20 19:10 kaleyneufeld

Hi @kaleyneufeld, Could you share the reproducible example so that we can investigate? Thanks!

ueshin avatar Oct 20 '20 20:10 ueshin

Just my guts but I think it's because ARROW_PRE_0_15_IPC_FORMAT is not set (see also https://github.com/databricks/koalas#getting-started)

HyukjinKwon avatar Oct 21 '20 00:10 HyukjinKwon

@ueshin here is some example code:

df = ks.DataFrame([1,2,3], columns=['col1'])
df.head()
df['col1'] = df['col1'].apply(lambda x: x)
df.head()

@HyukjinKwon I just tried setting ARROW_PRE_0_15_IPC_FORMAT to 1 and still got the same error message

kaleyneufeld avatar Oct 21 '20 14:10 kaleyneufeld

@kaleyneufeld Thanks for sharing it. I tried it with Koalas 1.2.0, 1.3.0, and the latest master, but it's working as expected.

What are the versions of Koalas, Spark, and PyArrow?

ueshin avatar Oct 21 '20 21:10 ueshin

@ueshin I'm using Spark 2.4.5, Koalas 1.3.0, and PyArrow 2.0.0

kaleyneufeld avatar Oct 22 '20 15:10 kaleyneufeld

In that case, I'd also suspect ARROW_PRE_0_15_IPC_FORMAT is not set properly.

Could you try:

import os
os.environ.get('ARROW_PRE_0_15_IPC_FORMAT', 'None')

and

from pyspark.sql.functions import udf

@udf('string')
def check(x):
  return os.environ.get('ARROW_PRE_0_15_IPC_FORMAT', 'None')

spark.range(1).select(check('id')).head()[0]

Both should return '1'.

ueshin avatar Oct 22 '20 17:10 ueshin

Also, what if with an older PyArrow than 0.15.0, like 0.14.1?

ueshin avatar Oct 22 '20 17:10 ueshin