pytd icon indicating copy to clipboard operation
pytd copied to clipboard

to_td() fails with null only column

Open chezou opened this issue 6 years ago • 2 comments

If there is a column which has null values only, to_td() fails to upload the dataframe. It should be avoided setting spark.sql.execution.arrow.fallback.enabled=true for PySpark config.

>>> import pytd.pandas_td as td
>>> engine = td.create_engine('presto:sample_datasets')
>>> df = td.read_td('select * from www_access limit 100', engine)
>>> df.isnull().sum()
user       100
host         0
path         0
referer      0
code         0
agent        0
size         0
method       0
dtype: int64
>>> con = td.connect()
>>> df.drop(columns='time', inplace=True)
>>> td.to_td(df, 'aki.test_pytd', con, if_exists='replace', index=False)
19/05/09 17:30:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  Unsupported type in conversion from Arrow: null
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
  warnings.warn(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pytd/pandas_td/__init__.py", line 349, in to_td
    writer.write_dataframe(frame, con.database, name, mode)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pytd/writer.py", line 88, in write_dataframe
    sdf = self.td_spark.createDataFrame(df)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 748, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 416, in _createFromLocal
    struct = self._inferSchemaFromList(data, names=schema)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 350, in _inferSchemaFromList
    raise ValueError("Some of types cannot be determined after inferring")
ValueError: Some of types cannot be determined after inferring

chezou avatar May 09 '19 08:05 chezou

Looks like it doesn't matter of arrow. Everything happens in the Spark code in: https://github.com/apache/spark/blob/d36cce18e262dc9cbd687ef42f8b67a62f0a3e22/python/pyspark/sql/session.py#L619-L787

createDataFrame -> _createFromLocal -> _inferSchemaFromList -> _has_nulltype returns true and raises the exception

takuti avatar May 09 '19 09:05 takuti

pytd may need to validate column values before createDataFrame

takuti avatar May 09 '19 09:05 takuti