pytd
pytd copied to clipboard
to_td() fails with null only column
If there is a column which has null values only, to_td() fails to upload the dataframe. It should be avoided setting spark.sql.execution.arrow.fallback.enabled=true for PySpark config.
>>> import pytd.pandas_td as td
>>> engine = td.create_engine('presto:sample_datasets')
>>> df = td.read_td('select * from www_access limit 100', engine)
>>> df.isnull().sum()
user 100
host 0
path 0
referer 0
code 0
agent 0
size 0
method 0
dtype: int64
>>> con = td.connect()
>>> df.drop(columns='time', inplace=True)
>>> td.to_td(df, 'aki.test_pytd', con, if_exists='replace', index=False)
19/05/09 17:30:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
Unsupported type in conversion from Arrow: null
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
warnings.warn(msg)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pytd/pandas_td/__init__.py", line 349, in to_td
writer.write_dataframe(frame, con.database, name, mode)
File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pytd/writer.py", line 88, in write_dataframe
sdf = self.td_spark.createDataFrame(df)
File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 748, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 416, in _createFromLocal
struct = self._inferSchemaFromList(data, names=schema)
File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 350, in _inferSchemaFromList
raise ValueError("Some of types cannot be determined after inferring")
ValueError: Some of types cannot be determined after inferring
Looks like it doesn't matter of arrow. Everything happens in the Spark code in: https://github.com/apache/spark/blob/d36cce18e262dc9cbd687ef42f8b67a62f0a3e22/python/pyspark/sql/session.py#L619-L787
createDataFrame
-> _createFromLocal
-> _inferSchemaFromList
-> _has_nulltype returns true and raises the exception
pytd may need to validate column values before createDataFrame